Front Matter

Paez, Antonio (2021)
An Introduction to Spatial Data Analysis and Statistics: A Course in R
[Publisher]
ISBN: 978-1-7778515-0-7
DOI
In GitHub: https://github.com/paezha/spatial-analysis-r

Antonio Paez
School of Earth, Environment and Society
McMaster University
Hamilton, Ontario
Canada

ORCID logo https://orcid.org/0000-0001-6912-9919
Google Scholar logo
Publons logo
GitHub logo

What is this book and who is it for?

Words go here.

Allied resources

You can find additional resources for students and instructors here

Contributing

An advantage of an Open Educational Resource compared to traditional publishing (besides it being free!) is that it is a live, ongoing project, for as long as anyone cares to keep it alive. If you are using this resource, I would encourage you to contribute to help me improve it, by:

  • suggesting improvements to the text, e.g. clarifying unclear sentences, fixing typos (see guidance from Yihui Xie);
  • proposing changes to the code, e.g. to do things in a more efficient way.

Doing this is as easy as editing a wiki page; use the Edit button in the toolbar to make a pull request (you need to have a GitHub account and be able to fork the repository):

EDIT ME

In addition, please feel free to make requests for features or to develop content (see the project’s issue tracker).

0.0.1 License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This means that you are free to:

  • Share it: you can copy and redistribute the material in any medium or format
  • Adapt it: you can remix, transform, and build upon the material

Under the following terms:

  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • NonCommercial: You may not use the material for commercial purposes.

  • ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

  • No additional restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

These freedoms cannot be revoked by the licensor (that is me) as long as you follow the license terms.

How to support this project

Preface

“Patterns cannot be weighed or measured. Patterns must be mapped.”

— Fritjof Capra, The Web of Life: A New Scientific Understanding of Living Systems

Spatial Analysis and Spatial Statistics

The field of spatial statistics has experienced phenomenal growth in the past 20 years.

From being a niche subdiscipline in quantitative geography, statistics, regional science, and ecology at the beginning of the 1990s, it is now a mainstay in applications in a multitude of fields, including medical imaging, remote sensing, civil engineering, geology, statistics and probability, spatial epidemiology, end ecology, to name just a few disciplines.

The growth in research and applications in spatial statistics has been in good measure fueled by the explosive growth in geotechnologies: technologies for sensing and describing the natural, social, and built environments on Earth. An outcome of this is that spatial data are, to an unprecedented level, within the reach of multitudes. Hardware and software have become cheaper and increasingly powerful, and we have transitioned from a data poor environment (in all respects, but particularly in terms of spatial data) to a data rich environment. Twenty years ago, for instance, technical skills in spatial analysis included tasks such as digitizing. In the mid-1990s, as a Masters student, I spent many boring hours digitizing paper maps before I could do any analysis on the single-seat (and relatively expensive) Geographic Information System (GIS) available in my laboratory. In that place at that time I was more or less a geographical freak: although there was an institutional push to adopt GIS, relatively few in my academic environment saw the value of spending hours digitizing maps, something that nowadays would be considered relatively low-level technical work. Surely, the time of a Masters student, let alone a professional researcher or business analyst, is more valuable than that. Indeed, very little time is spent anymore in such low-level tasks, as data increasingly are collected and disseminated in native digital formats. Instead, there is a huge appetite for what could be called the brainware of spatial analysis, the intelligence counterpart of the hardware, software, and data provided by geotechnologies.

The contribution of brainware to spatial analysis is to make sense of vast amounts of data, in effect transforming them into information. This information in turn can be useful to understand basic scientific questions (e.g., changes in land cover), to support public policy (e.g., what is the value capture of public infrastructure), and to inform business decisions (e.g., what levels of demand can be expected given the distribution of retail outlets). There are numerous forms of spatial analysis, including normative techniques [such as spatial optimization; see Tong and Murray (2012)] and geometric and cartographic analysis [for instance, map algebra; Tomlin (1990)]. Among these, spatial statistics is one of the key elements in the family of toolboxes for spatial analysis.

So what is spatial statistics?

Very quickly, I will define spatial statistics as the application of statistical techniques to data that have geographical references - in other words, to the statistical analysis of maps.

Like statistics more generally, spatial statistics is interested in hypothesis testing and inference. What distinguishes it as a branch of the broader field of statistics is its explicit interest in situations where data are not independent from each other (like throws of fair dice) but rather display systemic associations. These associations, when seen through the lens of cartography, can manifest themselves as patterns of similarities (e.g., birds of a feather flock together) or dissimilarities (e.g., repulsion due spatial competition among firms) - as two common examples of spatial patterns.

Spatial statistics covers a broad array of techniques for the analysis of spatial patterns, including tools for testing whether patterns are random or not, and a wide variety of modeling approaches as well. These tools enhance the brainware of analysts by allowing them to identify and possibly model patterns for inferring processes and/or for making spatial predictions.

Why this Text?

The objective of this book is to introduce selected topics in applied spatial statistics.

The foundations for the book are the notes that I have developed over several years of teaching applied spatial statistics at McMaster University. This course is a specialist course for senior-level undergraduate geographers and students in other disciplines who are often working towards specializations in GIS.

Over the course of the years, my colleagues at McMaster and I have used at least three different textbooks for teaching spatial statistics. I have personally used the book by McGrew and Monroe (2009) to introduce fundamental statistical concepts to geographers. McGrew and Monroe (currently on a third edition with Lembo) do a fine job of introducing statistics as a tool for decision making, and therefore offer a very valuable resource to learn matters of inference, for instance. Many of the examples in the book are geographical in nature; however, the book is relatively limited in its coverage of spatial statistics (particularly models for spatial processes), which is a limitation for teaching a specialist course on this topic.

My text of choice early on (approximately between 2003 and 2010) was the excellent book Interactive Spatial Data Analysis by Bailey and Gatrell (1995). A notable aspect of Bailey and Gatrell was that the book was accompanied by a software application to implement the techniques it discussed. I started using this book as a graduate student around 1998, but even then the limitations of the software that accompanied the book were apparent - in particular the absence of updates or a central repository for code (the book had a sleeve to store a \(3\frac{1}{2}\) floppy disk to install the software). Despite the regrettable obsolescence of the software, the book provided then, and still does, a very accessible yet rigorous treatment of many topics of interest in spatial statistics. Bailey and Gatrell’s book was, I believe, the first attempt to bridge, on the one hand, the need to teach mid- and upper-level university courses in spatial statistics, and on the other, the challenges of doing so with the very specialized texts on this topic that existed at the time, including the excellent but demanding Spatial Econometrics (Anselin 1988), Advanced Spatial Statistics (Griffith 1988), Spatial Data Analysis in the Social and Environmental Sciences (Haining 1990), not to mention Statistics for Spatial Data (Cressie 1993).

More recently, as Bailey and Gatrell aged, my book of choice for teaching spatial statistics became O’Sullivan and Unwin’s Geographical Information Analysis (O’Sullivan and Unwin 2010). This book updated a number of topics that were not covered by Bailey and Gatrell. To give one example, much work happened in the mid- to late-1990s with the development of local forms of spatial analysis, including pioneering research by Getis and Ord on concentration statistics (Getis and Ord 1992), Anselin’s Local Indicators of Spatial Association (Anselin 1995), and Brunsdon, Fotheringham, and Charlton’s research on geographically weighted regression (C. Brunsdon, Fotheringham, and Charlton 1996). These and related local forms of spatial analysis have become hugely influential in the intervening years, and are duly covered by O’Sullivan and Unwin in a way that merges well with a course focusing on spatial statistics - although other specialist texts also exist that delve in much more depth into some of these topics (e.g., Fotheringham and Brunsdon 1999; and Lloyd 2010).

These resources, and many more, have proved invaluable for my teaching for the past few years, and I am sure that their influence will be evident in the present book. Other excellent sources are also available, including Applied Spatial Data Analysis in R (Bivand, Pebesma, and Gómez-Rubio 2008), Spatial Data Analysis in Ecology and Agriculture Using R (Plant 2012), An Introduction to R for Spatial Analysis & Mapping (Chris Brunsdon and Comber 2015), Spatial Point Patterns: Methodology and Applications with R (Baddeley, Rubak, and Turner 2016), and Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019). This is in addition to other resources available online, such as M. Gismond’s Intro to GIS and Spatial Analysis and R. Hijmans’s Spatial Data Analysis and Modeling with R.

So, if there are some excellent resources for teaching and learning spatial statistics, why am I moved to unleash on the world yet another text on this topic?

I am convinced that there is richness in variety.

As demand for training in spatial statistics grows, there is potential for different sources to satisfy different needs. Some books are geared towards specialized topics [e.g., point pattern analysis; Baddeley, Rubak, and Turner (2016)] and cover their subject matter in much more depth than I could in an undergraduate course. For this reason, they are more useful as a reference or a tool for learning for researchers and graduate students. Other books focus more heavily on mapping in R than a course on spatial statistics can comfortably accommodate (e.g., Chris Brunsdon and Comber 2015; Lovelace, Nowosad, and Muenchow 2019). And yet other books are geared towards specific disciplines [e.g., ecology and agriculture; Plant (2012)]. Bivand et al. (2008) is an excellent reference. At the time of their writing, much work was devoted to issues of spatial data representation. As a consequence, a good portion of their book is concerned with the critical issue of handling spatial data, including data classes and import/export operations which, while essential, happen for most practitioners at a baser level.

My approach can be seen as complementary to some of the texts above.

I have tried to write a text that introduces key concepts of data handling and mapping in R as they are needed to learn and practice spatial statistical analysis. This I have tried to do as intuitively as I could. Readers will see that the computational part of the book - everything that usually lives “under the hood”, so to speak - is all bare in the open. The code is extensively documented as it is introduced (with extensive repetition for pedagogical purposes). Once that a reader has seen and used some commands, we proceed to introduce more sophisticated computational approaches, which are in turn documented extensively when they first appear. I like to think of this approach as introducing coding by stealth, with a gentle ramp for those students who may not have extensive experience in computer-speak. These computational aspects constitute the “how to” of the book. How to calculate a summary statistic. How to create a plot. How to map a variable. How to estimate a model.

The how to is an essential foundation for then exercising the brainware. By introducing the tools needed to accomplish data analysis tasks in a relatively gentle way, I have been able to concentrate in introducing (again, in what I hope is an intuitive way!) key concepts in spatial statistics. The text is not meant to be used as a reference, although some lectors may find that it works in that way in particular with respect to the implementation of techniques. Rather, the text is more suitable to be read linearly - indeed as a course on the topic of spatial statistics. Readers who have familiarized themselves with the text can possibly find it useful as a reference, but I do not recommend using it as a reference in the first place.

Lastly, the focus of the text is on applied spatial statistics. There is, inevitably, a component of math, but I have tried, to the extent of my ability, to make the underlying math as intuitive and accessible as possible. As noted above, there is also an important computational component - in particular, as per the title, using the R statistical language. As McElreath (2016) notes, in addition to the pedagogical value of teaching statistics using a coding approach, much of statistics has in fact become so computational that coding skills are increasingly indispensable. I tend to agree with this, and there are reasons to believe that one of the strengths of this approach as well is to make statistical work as open, clear, and reproducible as possible (see Rey 2009).

Plan

My aim with this book is to introduce key concepts and techniques in the statistical analysis of spatial data in an intuitive way. While there are other resources that offer more advanced treatments of every single one of these topics, this book should be appealing to undergraduate students or others who are approaching the topic for the first time.

The book is organized thematically following the canonical approach seen, for instance, in Bailey and Gatrell (1995), Bivand et al. (2008), and O’Sullivan and Unwin (2010). This approach is to conceptualize data by their unit of support. Accordingly, data are seen as being represented by:

  1. Discrete processes in space (e.g., points and events).

  2. Aggregations into zones for statistical purposes (e.g. demographic variables into census areas).

  3. As discrete measurements in space of an underlying continuous process (e.g. weather stations monitoring temperature)

The book is organized in such a way that each chapter covers a topic that builds on previous material. All chapters, starting with Chapter 3, are followed by an activity.

I have used the materials presented in this texts (in a variety of incarnations) for teaching spatial data analysis in different settings. Primarily, these notes have been used in the course GEOG 4GA3 Applied Spatial Statistics at McMaster University. This course is a full (Canadian) academic term, which typically means 13 weeks of classes. The course is organized as a 2-hour-per-week class, with a GIS-lab component which uses a complementary set of notes. For this reason, each chapter is designed to cover very approximately the material that I am used to cover in a 50 minutes lecture in a traditional classroom-lecturing setting. In this case, the activities that accompany each chapter could be assigned as homework, optional materials, or as lab materials. For instructors who do not have a lab component, the activities could easily be adapted as lab exercises.

More recently, I have experimented with delivery of contents in a flipped classroom format (see here for a discussion of flipped classrooms).

Briefly, a flipped classroom changes the role of the instructor and the delivery of contents. In a flipped classroom, the instructor minimizes lecturing time, opting instead for offering study materials in advance (often the materials are online and may have an interactive component). This frees the instructor from the tyranny of lecturing, so that in-class time can be dedicated instead to hands-on activities. The instructor is no longer a magical source of wisdom, but rather a partner in the learning process. Under this scenario, students are responsible for reading the chapter or chapters required in advance to a class. The class then is dedicated to the activity that follows the chapter, with students working individually or in small groups in the activity. I have broken a 50-minutes session of this type as follows: 10 minutes for a short mini-lecture and to discuss any questions about the preceding reading/study materials, followed by 30 minutes to complete the activity; during this time I engage individually or in small groups with the students as they work; and before the end of the 50-minutes session a 10 minute recap, where I summarize the key aspects of the lesson, clearly identify the threshold concepts covered, and indicate how this relates to the next lesson. Increasingly I see this format as a form of apprenticeship, where the students learn by doing, and see links (which I have yet to explore) to experiential learning.

In addition to the two formats above (traditional classroom-lecture and flipped classroom), I have also used portions of these notes to teach short courses in different places, including at Universidade de Sao Paulo in Brazil, the University of Western Australia, at the Gran Sasso Scientific Institute in Italy, and Universidad Politecnica de Madrid, in Spain, among other places. The materials can, with only relatively minor modifications, be used in this way.

As I continue to work on these notes, I hope to be able to add optional (or bonus) chapters, that could be used 1) to extend a course on spatial statistics beyond the 13 week horizon of the Canadian term, and/or 2) to offer more advanced material to interested readers see here for an example on spatial filtering.

Audience

The notes were designed for a course in geography, but in fact, could be easily adjusted for an audience of earth scientists, environmental scientists, econometricians, planners, or students in other disciplines who have an interest in and work with georeferenced datasets. The prerequisites are an introductory college/university level course on multivariate statistics, ideally covering the fundamentals of probability, hypothesis testing, and multivariate linear regression analysis.

Requisites

To fully benefit from this text, up-to-date copies of R and RStudio are highly recommended. Many examples in the text use datasets that have been packaged for convenience as an R package. To install the package (geog4ga3) use the following command(which requires remotes):

library(remotes)
remotes::install_github("paezha/isdas")

The source files for the chapters and activities can be obtained from the following GitHub repository:

https://github.com/paezha/spatial-analysis-R

Words of Appreciation

I would like to express my gratitude to the Paul R. MacPherson Institute for Leadership, Innovation and Excellence in Teaching. The Institute supported, through its Student Partners program, my work with some amazing student partners. As part of this program, I worked with Mr. Rajveer Ubhi in the Fall of 2018 and Winter of 2019 organizing all the materials for the text, documenting the code, and ensuring that it satisfied student needs. I also had the opportunity to work with Ms. Megan Coad and Ms. Alexis Polidoro in the Fall of 2019 and Winter of 2020. As former students of the course, Ms. Coad and Polidoro helped to develop a set of mini-lectures to accompany the materials, continued to document the code, and tested the activities. In the Winter 2020 they also accompanied me in the classroom to work directly with new students. Dr. Anastasios Dardas helped develop illustrative applications that helped us understand the value of interactivity in delivering many of the contents.

Working with these wonderful individuals has been a pleasure, and I am grateful for their contributions to this effort.

Versioning

These notes were developed using the following version of R:

##                _                                
## platform       x86_64-w64-mingw32               
## arch           x86_64                           
## os             mingw32                          
## crt            ucrt                             
## system         x86_64, mingw32                  
## status                                          
## major          4                                
## minor          2.1                              
## year           2022                             
## month          06                               
## day            23                               
## svn rev        82513                            
## language       R                                
## version.string R version 4.2.1 (2022-06-23 ucrt)
## nickname       Funny-Looking Kid

(PART) Part I: Getting to Know the Technology

1 Preliminaries: Installing R and RStudio

1.1 Introduction

Statistical analysis is the study of the properties of a dataset. There are different aspects of statistical analysis, and they often require that we work with data that are messy. According to Wickham and Grolemund (2016), computer-assisted data analysis includes the steps outlined in Figure @ref(fig:data-analysis-process).

First, the data are imported to a suitable software application. This can include data from primary sources (suppose that you collected coordinates using a GPS) or from secondary sources (the Census of Canada). Data will likely be text tables, or an Excel file, among other possible formats. Before data can be analyzed, they need to be tidied. This means that the data need to be arranged in such a way that they match the process that you are interested in. For instance, a travel survey can be organized so that each row is a traveler, or as an alternative so that each row is a trip.

Once that data are tidy, Exploratory Data Analysis (EDA) and/or its geographical extension Exploratory Spatial Data Analysis (ESDA) can be conducted. This involves transforming the raw data into information. Examples of transformations include calculating the mean and the standard deviation. Visualization is also part of this exploratory exercise. In EDA this could be creating a histogram or a scatterplot. Mapping is a key visualization technique in spatial statistics.

Modeling is a process that further extracts information from the data, typically by looking at relationships between multiple variables.

All of the tasks mentioned above, and many more, can be handled easily in a variety of software applications. For this course, you will use the statistical computing language R.

\label{fig:data-analysis-process} The process of doing data analysis (from Wickham and Grolemund, 2016)

(#fig:ch01-data-analysis-process) The process of doing data analysis (from Wickham and Grolemund, 2016)

1.2 Learning Objectives

In this reading, you will learn:

  1. How to install R.
  2. About the RStudio Interactive Development Environment.
  3. About packages in R.

1.3 R: The Open Statistical Computing Project

1.3.1 What is R?

R is an open-source language for statistical computing. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland, in New Zealand, as a way to offer their students an accessible, no-cost tool for their courses. R is now maintained by the R Development Core Team, and is developed by hundreds of contributors around the globe. R is an attractive alternative to other software applications for data analysis (e.g., Microsoft Excel, STATA) due to its open-source character (i.e., it is free), its flexibility, and large and dedicated user community. The presence of a very active community of developers and users, especially in an open context, means if there is something you want to do (for instance, linear regression), it is very likely that someone has already developed functionality for it in R.

A good way to think about R is as a core package, to which a library, consisting of additional packages, can be attached to increase its functionality. R can be downloaded for free at:

https://cran.rstudio.com/

R comes with a built-in console (a user graphical interface), but better alternatives to the basic interface exist, including RStudio, an Integrated Development Environment, or IDE for short. RStudio can also be downloaded for free, by visiting the website:

https://www.rstudio.com/products/rstudio/download/

R requires you to work using the command line, which is going to be unfamiliar to many of you accustomed to user-friendly graphical interfaces. Do not fear. People worked for a long time using the command line, or even more cumbersomely, using punched cards in early computers. Graphical user interfaces are convenient, but they have a major drawback, namely their inflexibility. A program that functions based on graphical user interfaces allows you to do only what is hard-coded in the user interface. Command line, as we will see, is somewhat more involved, but provides much more flexibility in operation, and it frees you from the constraints inherent in a point-and-click system.

Go ahead. Install R and RStudio in your computer. (If you are at McMaster working in the GIS lab, you will find that these have already been installed there).

Before introducing some basic functionality in R, lets quickly take a tour of the RStudio IDE.

1.3.2 The RStudio IDE

The RStudio IDE provides a very complete interface to interact with the language R, and do much more in addition. It consists of a window with several panes. Some panes include, in addition, several tabs. There are the usual drop-down menus for common operations, such as creating new files, saving, common commands for editing, etc. See Figure @ref(fig:RStudio-IDE) below.

\label{fig:RStudio-IDE}The RStudio IDE

(#fig:ch01-RStudio-IDE)The RStudio IDE

The editor pane allows you to open and work with text and other files, where you can write instructions that can be passed on to the program. Writing something in the editor does not execute any instructions, it merely records them for possible future use. In fact, much of what is written in the editor will not be instructions, but rather comments, discussion, and other text that is useful to understand code.

The console pane is where instructions are passed on to the program. When an instruction is typed (or copied and pasted) there, R will understand that it needs to do something. The instructions must be written in a way that R understands, otherwise errors will occur. If you have typed instructions in the editor, you can use “ctrl-Enter” (in Windows) or “cmd-Enter” (in Mac) to send to the console and execute.

The environment is where all data that is currently in memory is reported. The History tab acts like a log: it keeps track of the instructions that have been executed in the console.

The last pane includes a number of useful tabs. The File tab allows you to navigate your computer, change the working directory, see what files are where, and so on. The Plot tab is where plots are rendered, when instructions require R to do so. The Packages tab allows you to manage packages, which as mentioned above, are pieces of code that can augment the functionality of R. The Help tab is where you can consult the documentation for functions/packages/see examples, and so on. The Viewer tab is for displaying local web content, for instance, to preview a Notebook (more on Notebooks soon).

This brief introduction should have allowed you to install both R and RStudio. The next thing that you will need is a library of packages.

1.4 Packages in R

According to Wickham (2015) packages are the basic units of reproducible code in the R multiverse. Packages allow a developer to create a self-contained unit of code that often is meant to achieve some task. For instance, there are packages in R that specialize in statistical techniques, such as cluster analysis, visualization, or data manipulation. Some packages can be miscellaneous tools, or contain mostly datasets. Packages are a very convenient way of maintaining code, data, and documentation, and also of sharing all these resources.

Packages can be obtained from different sources (including making them!). One of the reasons why R has become so successful is the relative facility with which packages can be distributed. A package that I use frequently is called tidyverse (Wickham 2017). The tidyverse is a collection of functions for data manipulation, analysis, and visualization. This package can be downloaded and installed in your personal library of R packages by using the function install.packages, as follows:

install.packages("tidyverse")

The function install.packages retrieves packages from the Comprehensive R Archive Network, or CRAN for short. CRAN is a collection of sites (accessible via the internet) that carry identical materials for distribution for R.

There are other ways of distributing packages. For instance, throughout this book you will make use of a package called isdas that contains a collection of datasets and functions used in the readings or activities. This package is not on CRAN, but instead can be obtained from GitHub, a repository and versioning system. To retrieve packages from GitHub you need a function called install_github, which in turn is part of the package devtools.

To download and install the package isdas, you need first to download and install devtools as follows:

install.packages("remotes")

Once that a package has been downloaded and installed, it needs to be loaded into a session to be available to use. I find it useful to think of packages that I download as “books” that I place in my personal “bookshelf”. Some “books” I obtain from the central library (i.e., CRAN), while others are shared by friends, and some I have even written myself. Once that the “books” are in my “bookshelf” they are part of my own personal library. This means that they are available for use. Next time I want to use a “book” from my library, I need to retrieve it from the bookshelf. This is similar to taking the book and opening it on my desk: now all the magic contained in the package is available for use!

Similarly, once that the book is in my library, I do not need to retrieve it again from the bookstore - a package, once installed, does not need to be installed again (it might need updates, but that is a different matter).

This analogy suggests that I can have many packages in my library, only some of which I may need at any specific time for a task. To retrieve a package (i.e., a book) from the library, so that we can use it, the function library is invoked as in this example:

library(remotes)

This allows you to use all the functions in the package remotes. In particular, at this point you want to use a function that allows you to retrieve other packages! With the functionality of remotes::install_github you can download and install the companion package for the book by running the following instruction:

remotes::install_github("paezha/isdas")

This will install the package (i.e., put it in your library) so that you can also benefit from its functionality.

2 Basic Operations and Data Structures in R

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

The preceding chapter showed you how to install R and RStudio, and explained some key concepts, such as packages as fundamental units of reproducible code, and the concept of your library (where the packages that you install are stored).

Now that you have installed R and RStudio we can begin with an overview of basic operations and data structures in this computing language. Please note that this document you are reading, called an R Notebook, is an example of what is called literate programming, a style of document that uses code to illustrate a discussion, as opposed to the traditional programming style that uses natural language to discuss/document the code. By focusing on natural language as opposed to computer-speak, literate programming flips around the usual manner of technical writing to make documents more intuitive and accessible.

Whenever you see a chunk of code in an R Notebook, you can run it (by clicking the ‘play’ icon on the top right corner) to see the results. Try it! You can also copy and paste to your console if you are reading the web version of the book.

print("Hello, Geography 4GA3")
## [1] "Hello, Geography 4GA3"

The chunk of code above instructed R (and through R the computer) to print (or display on the screen) some text.

2.1 Learning Objectives

In this practice, you will learn:

  1. Basic operations in R.
  2. Data classes, data types, and data transformations.
  3. More about the use of packages in R.
  4. Basic visualization.

2.2 RStudio IDE

If you are reading this, you probably already read the introductory chapter that instructed you to install R and RStudio. We can now proceed to discuss some basic concepts of operations and data types.

2.3 Some Basic Operations

R can perform many types of operations. Some simple operations are arithmetic. Other are logical. And so on.

For instance, R can be instructed to conduct sums, as follows:

# `R` understands numbers and arithmetic operators such as `+` for addition
2 + 2
## [1] 4

R can be instructed to do multiplications:

# The sign to instruct `R` to multiply is `*`
2 * 3
## [1] 6

And sequences of operations, possibly using brackets to indicate their order. Compare the following two expressions:

2 * 3 + 5
## [1] 11
2 * (3 + 5)
## [1] 16

Other operations produce logical results (values of true and false):

# Is the statement true?
3 > 2
## [1] TRUE
# Is this true?
3 < 2
## [1] FALSE

And of course, you can combine operations in an expression:

2 * 3 + 5 < 2 * (3 + 5)
## [1] TRUE

As you can see, R can be used as a calculator, but it is much more powerful than that.

We can also create variables. You can think of a variable as a box with a name, whose contents can change. Variables are used to keep track of important stuff in your calculations, and to automate operations. To create a variable, a value is assigned to a name, using this notation <-. You can read this x <- 2 as “assign the value of 2 to a variable called x”. For instance:

# `<-` means "put the value of 2 in the object called `x`"
x <- 2
# `<-` means "put the value of 3 in the object called `y`"
y <- 3
# `<-` means "put the value of 5 in the object called `z`"
z <- 5

Check your “Global Environment”, the tab where the contents of your “Workspace” are displayed for you. You can also simply type the name of the variable in the Console to see its contents. Now that we have some variables with values, we can express operations as follows (same as above)

x * y + z
## [1] 11
x * (y + z)
## [1] 16

However, if we wanted, we could change the values of any of x, y, and/or z and repeat the operations. This allows to automate some instructions:

x <- 4
x * y + z
## [1] 17

The famous mathematician Henri Poincaré once wrote that “[m]athematics is the art of giving the same name to different things”. Working with a computer language is a lot like that: giving the same name to different values allows us to explore with ease “what would happen if…”. It is a very powerful tool to help us understand the world.

2.4 Data Classes in R

As you saw above R can work with different data classes. Some data are numbers. Other data are logical (i.e., take values of TRUE or FALSE). These are some data classes:

  • Numerical
  • Character
  • Logical
  • Factor

The existence of different data classes is very useful, since it allows you to store information in different forms. For instance, you may want to save some text:

name <- "Hamilton"

Or numerical information:

population <- 551751

If you wish to check what class an object is, you can use the function class:

class(name)
## [1] "character"
class(population)
## [1] "numeric"

2.5 Data Types in R

R can work with different data types, including scalars (essentially matrices with only one element), vectors (matrices with one dimension of size 1) and matrices (more generally).

print('This is a scalar')
## [1] "This is a scalar"
1
## [1] 1
print('This is a vector')
## [1] "This is a vector"
# c() is a function to concatenate, that is, to put values in a vector
c(1,2,3,4)
## [1] 1 2 3 4
print('This is a matrix')
## [1] "This is a matrix"
# matrix() creates a two-dimensional array with `nrow` rows, and `ncol` columns
matrix(c(1,2,3,4),nrow = 2, ncol=2)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

The command c() is used to concatenate the arguments, that is, to join them in a single object. The objects must be of the same class: they must be all numeric, or all character, or all logical, and so on. We cannot combine different data classes in a vector. The command matrix() creates a matrix with the specified number of rows and columns.

An important data type in R is a data frame. A data frame is a table consisting of rows and columns - commonly a set of vectors that have been collected for convenience. A data frame is used to store data in digital format. (If you have used Excel or another spreadsheet software before, data frames will be familiar to you: they look a lot like a sheet in a spreadsheet.)

A data frame can accommodate large amounts of information (several billion individual items). The data can be numeric, character, logical, and so on. Each grid cell in a data frame has an address that can be identified based on the row and column it belongs to. R can use these addresses to perform mathematical operations. `R`` labels columns alphabetically and rows numerically (or less commonly alphabetically).

To illustrate a data frame, let us first create the following vectors, that include names (character class), populations (numeric class), average salaries (numeric class), and coordinates (numeric class) of some cities:

# c() is a function to concatenate, that is, to put values in a vector
Name <- c('Hamilton','Waterloo','Toronto')
Population <- c(551751, 219153, 2731571)
AvgSalary <- c(45692, 57625, 48920)
Latitude <- c(43.255203, 43.4668, 43.6532)
Longitude <- c(-79.843826, -80.51639, -79.3832)

Again, note that <- is an assignment. In other words, it assigns the item on the right to the name on the left.

After you execute the chunk of code above, you will notice that new values appear in your Environment. These are five vectors of size 1:3. You can also see what is the class of the vector: one that is composed of alphanumeric information (or chr, for ‘character’) and four columns that are numeric (num).

These vectors can be collected in a dataframe. This is done for convenience, so we know that all these data belong together in some way. Please note that to create a data frame, the vectors must have the same length. In other words, you cannot create a table with elements that have different numbers of rows (other data types allow you to do this, but not data frames).

We will now create a data frame. We will call it “Cities”. There are rules for names (for example, they cannot begin with a number), but in most cases it helps if the names are intuitive and easy to remember. The function used to create a data frame is data.frame() and the arguments are the vectors that we wish to collect there.

Cities <- data.frame(Name, Population, AvgSalary, Latitude, Longitude)

After running the chunk above, now you have a new object in your environment, namely a data frame called Cities.

If you double click on Cities in the Environment tab, you will see that this data frame has five columns (labeled Name, Population, AvgSalary, Latitude, and Longitude), and three rows. You can enter data into a data frame and then use the many built-in functions of R to perform various types of analysis.

At this point, you may notice that Name, which was an alphanumeric vector, was converted to a factor in the data frame. A factor (data class) is a way to store nominal/categorical variables that may have two or more levels. Nominal variables are like labels. In the present case, the factor variable has three levels, corresponding to three cities. If we had information for multiple years, each city might appear more than once, for each year that information was available.

2.6 Indexing and Data Transformations

Data frames store information that is related in a compact way.

To perform operations effectively, it is useful to understand the way R locates information in a data frame. As noted before, each grid cell has an address, or in other words an index, that can be referenced in several convenient ways. For instance, assume that you wish to reference the first value of the data frame, that is, row 1 of column Name. To do this, you would use the following instruction:

# To index elements in a data frame we use square brackets `[]` 
# The first number in the square bracket is the row, and the second number 
# (separated by a comma) is the column 
Cities[1,1]
## [1] "Hamilton"

This will recall the element in the first row and first column of Cities. It also tells you what are the levels of this variable.

As an alternative, you could type:

Cities$Name[1]
## [1] "Hamilton"

As you see, this has the same effect. The string sign $ is used to reference columns in a data frame. Therefore, R will call the first element of Name in data frame Cities.

Cities[1,2] is identical to Cities$Name[2]. Try changing the code in the chunk and executing. If you type Cities$Name, R will recall the full column.

Indexing is useful to conduct operations. Suppose for instance, that you wished to calculate the total population of two cities, say Hamilton and Waterloo. You can execute the following instructions:

# The string sign `$` is used to make reference to a column in the data frame. 
# The square brackets index the row in the column.
Cities$Population[1] + Cities$Population[2]
## [1] 770904

(More involved indexing is also possible, for example, if we use logical operators. Do not worry too much about the details at this point, just verify that the results are identical)

# The indexing now is a logical statement. The double equal sign `==` is used 
# to make logical comparisons. `R` will find the rows for which `Cities$Name=='Hamilton'` 
# in the first element of the sum, and the rows for which `Cities$Name=='Waterloo'` 
# is true in the second element of the sum.
Cities$Population[Cities$Name=='Hamilton'] + Cities$Population[Cities$Name=='Waterloo']
## [1] 770904

Suppose that you wanted to calculate the total population of the cities in your data frame. To do this, you would use the function sum():

# `sum()` is a function to add all elements in a numerical vector.
# This could be a column in a data frame
sum(Cities$Population)
## [1] 3502475

You have already seen how it allows you to store in memory the results of some instruction, by means of an assignment <-. You can also perform many other useful operations. For instance, calculate the maximum value for a set of values:

# `max()` finds the maximum value in a numerical vector
max(Cities$Population)
## [1] 2731571

And, if you wanted to find which city is the one with the largest population, you would use a logical statement as an index:

# `R` will find all rows for which the statement 
# `Cities$Population==max(Cities$Population)`, that is, 
# all the rows with a population identical to the maximum population!  
Cities$Name[Cities$Population==max(Cities$Population)]
## [1] "Toronto"

As you see, Toronto is the largest city (by population) in this dataset. Using indexing in imaginative ways provides a way to do fairly sophisticated data analysis.

Likewise, the function for finding the minimum value for a set of values is min():

# `min() finds the minimum value in a numerical vector
min(Cities$Population)
## [1] 219153

Try calculating the average of the population of the cities, using the command mean(). Use the empty chunk below for this (the result should be 1167492), or do this in your console in RStudio:

Finding the maximum and minimum, aggregating (calculating the sum of a series of values), and finding the average are examples of transformations applied to the data. They give insights into aspects of the dataset that are not necessarily evident from the raw data, especially if the number of observations (or cases) is large. Imagine trying to visually scan a spreadsheet with ten thousand observations to find the maximum value stored there!

2.7 Visualization

The data frame, in essence a table, informative as it is, is no usually the best way to learn from the data. Transformations (or descriptive statistics as discussed above) are helpful to understand important properties of a dataset. In addition, visualization is often a valuable complement to data analysis. Say, we might be interested in finding which city has the largest population and which city has the smallest population in a dataset. We could achieve this by using similar instructions as before, for example:

# `paste()` is similar to `print()`, except that it converts everything #
# to characters before printing. We use this function because the contents 
# of `Name` in the data frame `Cities` are not characters, but levels of a factor`
paste('The city with the largest population is',
      Cities$Name[Cities$Population==max(Cities$Population)])
## [1] "The city with the largest population is Toronto"
paste('The city with the smallest population is', 
      Cities$Name[Cities$Population==min(Cities$Population)])
## [1] "The city with the smallest population is Waterloo"

Another way, perhaps more convenient of understanding these data is by visualizing them, using for instance a bar chart.

We will proceed to create a bar chart, using a package called ggplot2. This package implements a grammar of graphics, and is a very flexible way of creating plots in R. Since ggplot2 is a package, we first must ensure that it is installed. You can install it using the command install as follows:

# Once you have installed a package, it does not need to be installed again! 
# It already is in your library and you only need to load it with `library()`
install.packages("ggplot2")

As an alternative to the install.packages() function, you can use the Packages tab in RStudio. Simply navigate to the tab, click install, and select ggplot2 from there. Note that you need to install the package only once! Essentially install adds it to your library of packages, where it will remain available.

Once the package is installed, it becomes available, but to use it you must load it in memory (similar to opening a “book” on your desktop as you work). For this, we use the command library(), which is used to load a package, that is, to activate it for use.

Assuming that you already have installed ggplot2, we proceed to load it:

library(ggplot2)

Now all commands from the ggplot2 package are available to you.

The package ggplot2 works by layering a series of objects, beginning with a blank plot, to which we can add things. The command to create a plot is ggplot(). This command accepts different arguments. For instance, we can pass data to it in the form of a data frame. We can also indicate different aesthetic values, that is, the things that we wish to plot. None of this is plotted, though, until we indicate which kind of geom or geometric object we wish to plot.

For a bar chart, we would use the following instructions:

# The function `ggplot()` creates an object for plotting, 
# using a data frame as indicated by the input argument `data =`. 
# Furthermore, we can specify how to map elements in the data frame 
# to things in the plot. In this example, we wish to map the names 
# of cities to the x-axis of the plot, and the population to the y-axis 
# of the plot. Accordingly, we define as aesthetic values `aes()` `x = Name` 
# and `y = Population`. The geometric object that we wish to plot is bars, 
# so we use `geom_bar()` with the argument `stat = "identity"` so the data 
# are not transformed before plotting:
ggplot(data = Cities, 
       aes(x = Name, y = Population)) + 
  geom_bar(stat = "identity")

Since this is the first time that we use ggplot(), it is informative to break down these instructions. We are asking ggplot2 to create a plot that will use the data frame Cities. Furthermore, we tell it to use the values of Names in the x-axis, and the values of Population in the y-axis. Run the following chunk:

ggplot(data = Cities, 
       aes(x = Name, y = Population))

Notice how ggplot2 creates a blank plot, and it has yet to actually render any of the population information in there. We layer elements on a plot by using the + sign. It is only when we tell the package to add some geometric element that it renders something on the plot. In the previous case, we told ggplot2 to draw bars (by using the geom_bar() function). The argument of geom_bar was stat = 'identity', to indicate that the data for the y-axis was to be used ‘as-is’ without further statistical transformations.

There are many different geoms that can be used in ggplot2. You can always consult the help/tutorial files by typing ??ggplot2 in the console. See:

??ggplot2

2.8 Creating a Simple Map

We will see how maps are used in spatial statistical analysis. The simplest one that can be created is a so-called dot map. A dot map simply displays the locations of events of interest, as points. A dot map is, in fact, simply a scatterplot of the coordinates of events. We can use ggplot2 to create a simple dot map of the cities in our sample dataset. For this, we create a ggplot2 object, and for the x and y aesthetics we use the coordinates. The geometric element that we want to render is a point:

# The longitude is mapped to the x-axis of the plot and the latitude is mapped
# to the y-axis of the plot. The function `geom_points()` is used to draw points:
ggplot(data = Cities,
       aes(x = Longitude, y = Latitude)) + 
  geom_point()

This is a simple dot map that simply shows the locations of the cities. We can add labels by means of the geometric element text:

# `geom_text()` is used to write text on the plot, 
# still using the longitude and latitude information:
ggplot(data = Cities, 
       aes(x = Longitude, y = Latitude)) + 
  geom_point() + 
  geom_text(aes(label = Name))

The dot map above tells us the location of the cities in our dataframe and their name. We can include more information in the plot in different ways. For example, a proportional symbol map changes the size of the symbols (the points) to add information to the plot. To create a proportional symbol map, we add to the aesthetics the instruction to use some variable for the size of the symbols:

# The `size` of the points will be proportional to the
# Population` values in the data frame
ggplot(data = Cities, 
       aes(x = Longitude, y = Latitude)) + 
  geom_point(aes(size = Population)) + 
  geom_text(aes(label = Name))

Furthermore, we can fix the position of the labels by adding a vertical justification to the text (vjust), and to avoid the text from being cut we can also expand the limits of the plot (expand_limits()):

ggplot(data = Cities, 
       aes(x = Longitude, y = Latitude)) + 
  geom_point(aes(size = Population)) + 
  geom_text(aes(label = Name), 
            vjust = 2) + 
  expand_limits(x = c(-80.7, -79.2), 
                y = c(43.2, 43.7))

The example above has guided you in the creation of a relatively simple proportional symbols map! You can see that creating a plot is simply a matter of instructing R (through ggplot2) to complete a series of instructions:

  • Create a ggplot2 object using a dataset, which will render stuff at locations given by variable1 and variable 2: ggplot(data = dataset, aes(x = variable1, y = variable2))

  • Add stuff to the plot. For instance, to add points use geom_point, to add lines use geom_line, and so on.

Check the ggplot2 Cheat Sheet for more information on how to use this package.

2.9 Examples of digital cartography in R

A last note. Many other visualization alternatives (for instance, Excel) provide point-and-click functions for creating plots. In contrast, working in R requires plots to be created by meticulously instructing the package what to do. While this is more laborious, it also means that you have complete control over the creation of plots, which in turn allows you to create more flexible and inventive visuals. Below are some of figures that I have created using R in recent years, including diagrams, thematic maps, and raster data.

\label{fig:visualization-example-1} Example of visualization: diagram of catchment areas for accessibility analysis (from [Paez, Higgins, and Vivona (2018)](https://doi.org/10.1371/journal.pone.0218773))

(#fig:ch02-visualization-example-1) Example of visualization: diagram of catchment areas for accessibility analysis (from Paez, Higgins, and Vivona (2018))

\label{fig:visualization-example-2} Example of visualization: accessibility to family doctors in Hamilton (from [Paez, Higgins, and Vivona (2018)](https://doi.org/10.1371/journal.pone.0218773))

(#fig:ch02-visualization-example-2) Example of visualization: accessibility to family doctors in Hamilton (from Paez, Higgins, and Vivona (2018))

\label{fig:visualization-example-3} Example of visualization: water sources (triangles) and households (circles) in a region in central Kenya (from [Paez et al. (2020)](https://doi.org/10.1016/j.jtrangeo.2019.102564))

(#fig:ch02-visualization-example-3) Example of visualization: water sources (triangles) and households (circles) in a region in central Kenya (from Paez et al. (2020))

And, these are some figures created using R by talented people around the world.

\label{fig:visualization-example-4} Example of visualization: Historical map with shading

(#fig:ch02-visualization-example-4) Example of visualization: Historical map with shading

\label{fig:visualization-example-5} Example of visualization: Population density of Madagascar

(#fig:ch02-visualization-example-5) Example of visualization: Population density of Madagascar

\label{fig:visualization-example-6} Example of visualization: Street map of Kyoto

(#fig:ch02-visualization-example-6) Example of visualization: Street map of Kyoto

\label{fig:visualization-example-7} Example of visualization: Median household income in California

(#fig:ch02-visualization-example-7) Example of visualization: Median household income in California

This concludes your basic overview of basic operations and data structures in R. You will have an opportunity to learn more about creating maps in R with your reading.

2.10 References

(PART) Part II: Statistics and Maps

3 Introduction to Mapping in R

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

Spatial statistics is a sub-field of spatial analysis that has grown in relevance in recent years as a result of 1) the availability of information that is geo-coded, in other words, that has geographical references; and 2) the availability of software to analyze such information.

A key technology fueling this trend is that of Geographical Information Systems (GIS). GIS are, in simplest terms, digital mapping for the 21st century. In most cases, however, GIS go beyond cartographic functions to also enable and enhance our ability to analyze data.

There are many available packages for geographical information analysis. Some are very user friendly, and widely available in many institutional contexts, such as ESRI’s Arc software. Others are fairly specialized, such as Caliper’s TransCAD, which implements many operations of interest for transportation engineering and planning.

Others packages have the advantage of being more flexible and/or free.

Such is the case of the R statistical computing language. R has been adopted by many in the spatial analysis community, and a number of specialized libraries have been developed to support mapping and spatial data analysis functions.

The objective of this note is to provide an introduction to mapping in R. Maps are one of the fundamental tools of spatial statistics and spatial analysis, and R allows for many GIS-like functions.

In the previous reading/practice you created a simple proportional symbols map. In this reading/practice you will learn how to create more sophisticated maps in R.

3.1 Learning Objectives

In this reading, you will:

  1. Revisit how to install and load a package.
  2. Learn how to invoke a data and view the data structure.
  3. Learn how to easily create maps using R.
  4. Think about how statistical maps help us understand patterns.

3.2 Suggested Readings

  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapters 2-3. Springer: New York
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 3. Sage: Los Angeles

3.3 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

# The function `ls()` lists all objects in the Environment, that is, your current workspace;
# `rm()` removes all objects listed in the argument `list = `
rm(list = ls())

3.4 Packages

According to Wickham (2015) packages are the basic units of reproducible code in the R multiverse.

Now that your workspace is clear, you can proceed to load a package. In this case, the package is the one used for this book/course, called isdas:

#The function 'library' is used to load the data we want to work with. 
# In this case, it is the isdas master package that we want to work with 
library(isdas)
## Warning: replacing previous import 'plotly::filter' by 'stats::filter' when loading 'isdas'
## Warning: replacing previous import 'dplyr::lag' by 'stats::lag' when loading 'isdas'

The package includes a few datasets that will be used throughout the book:

#The function `data` is used to check if a dataset is present within any loaded packages. 
# In this case, we are looking for `snow_deaths` and `snow_pumps`
data("snow_deaths")
data("snow_pumps")

3.5 Exploring Dataframes and a Simple Proportional Symbols Map

If you correctly loaded the library, you can now access the dataframes in the package isdas. For this section, you will need two dataframes, namely snow_pumps and snow_deaths:

#The function `head` will display the first few rows of the dataframe, snow_deaths 
head(snow_deaths)
##         long      lat Id Count
## 0 -0.1379301 51.51342  1     3
## 1 -0.1378831 51.51336  2     2
## 2 -0.1378529 51.51332  3     1
## 3 -0.1378120 51.51326  4     1
## 4 -0.1377668 51.51320  5     4
## 5 -0.1375369 51.51318  6     2

These data are from the famous London cholera example by John Snow (not the one from Game of Thrones, but the British physician). John Snow is considered the father of spatial epidemiology, and his study mapping the outbreak is credited with helping find its cause.This study investigates the cholera outbreak of Soho, London, in 1854.

The dataframe snow_deaths includes the geocoded addresses of cholera deaths in long and lat, and the number of deaths (the Count) recorded at each address, as well as unique identifiers for the addresses (Id).

A second dataframe snow_pumps includes the geocoded locations of water pumps in Soho:

head(snow_pumps)
##            long      lat  Id Count
## 01   -0.1366679 51.51334 251     1
## 1100 -0.1395862 51.51388 252     1
## 250  -0.1396710 51.51491 253     1
## 310  -0.1316299 51.51235 254     1
## 410  -0.1335944 51.51214 255     1
## 510  -0.1359191 51.51154 256     1

As in your previous reading, it is possible to map the cases using ggplot2. Begin by loading the package tidyverse:

#'Tidyverse' is a collection of R packages designed for data science used in everyday data analyses  
library(tidyverse)
## Warning: package 'stringr' was built under R version 4.2.2

Now, you can create a blank ggplot2 object from which you can render the points for deaths and the pumps.

#The function `ggplot` is used for data visualization - it creates a graph. 
# The function `geom_point` tells R you want to create a plot of points. 
# `data = snow_deaths` tells R you want to use the `snow_deaths` dataframe. 
# `aes` stands for aesthetics of your graph where `x = long` sets the x axis 
# to `long`, where `y = lat` sets the y axis to `lat`, where `color = blue` 
# colours the points blue and `shape = 16` assigns the shape of the points -
# in this case, `16` are circles and `17` are triangles  

ggplot() +
  geom_point(data = snow_deaths, aes(x = long, y = lat), color = "blue", shape = 16) +
  geom_point(data = snow_pumps, aes(x = long, y = lat), color = "black", shape = 17)

This map is a decent example of how to represent visually some contents in the dataframe. Here, information is displayed using different colours and symbols to represent pumps and deaths from the London Cholera Example. Though this map provides useful insights, it is not of the greatest quality. We will illustrate other ways of creating maps below, including interactive maps.

3.6 Improving on the Proportional Symbols Map

A package that extends the functionality of mapping in R is leaflet. A key feature of the leaflet package is the ability to make maps interactive for the user. We will see next how to enhance our proportional symbol map using this package. First you need to load the package (you need to install it first if you have not already):

# 'Leaflet' is a package used for visualizing data on a map in R. 
if (!require("leaflet")) install.packages('leaflet')
# 'Magrittr' is a package used for creating pipe operators 
if (!require("leaflet")) install.packages('magrittr')
library(leaflet)
library(magrittr)

The first step is to create a leaflet object, which will be saved in m :

# Here, we create a `leaflet` object and assign it to the variable, 'm'. 
# The `setView` function sets the view of the map where `lng = -0.136` 
# sets the longitutde, 'lat = 51.513' sets the latitude and the map zoom 
# is set to 16. The `%>%` is a pipe operator that passes the output from 
# the left hand side of the operator to the first argument of the right 
# hand side of the operator. In this case we are telling `R` that we want 
# to center the map on the set longitude and latitude, with a zoom level 
# of 16, which corresponds roughly to a neighborhood 

m <- leaflet(data = snow_deaths) %>% 
  setView(lng = -0.136, lat = 51.513, zoom = 16)

This map looks like this at this point:

m

The map is empty! This is because we have not yet added any geographical information to plot. We can begin by adding a basemap as follows:

# We are adding a basemap or background map of the study area 
# by means of the `addTiles` function to the 'm' variable 
m <- m %>% addTiles()
m

The map now shows the neighborhood in Soho where the cholera outbreak happened. Now, at long last, we can add the cases of cholera deaths to the map. For this, we indicate the coordinates (preceded by ~), and set an option for clustering by means of the clusterOptions in the following fashion:

# We are adding the cholera deaths to the map using 'group = Deaths'. 
# The '~' symbol tells R to use the same longitude and latitude values 
# used in the previous block of code and the 'clusterOptions = markerClusterOptions()'
# clusters a large number of markers on the map - in this case it is clusturing 
# number of deaths into icons with numbers  

m <- m %>% 
  addMarkers(~long, 
             ~lat, 
             clusterOptions = markerClusterOptions(),
             group = "Deaths")
m

The map now displays the locations of cholera deaths on the map. If you zoom in, the clusters will rearrange accordingly. Try it! The other information that we have available is the location of the water pumps, which we can add to the map above (notice that the Broad Street Pump is already shown in the basemap!):

m %>% 
  addMarkers(data = snow_pumps, 
             ~long, 
             ~lat, 
             group = "Pumps")

An alternative and quicker way to run the same bit of code is by means of pipe operators (%>%). These operators make writing code a lot faster, easier to read, and more intuitive! Recall that a pipe operator will take the output of the preceding function, and pass it on as the first argument of the next:

m_test <- leaflet () %>% 
  setView(lng = -0.136, lat = 51.513, zoom = 16) %>% 
  addTiles() %>% 
  addMarkers(data = snow_deaths, 
             ~long, 
             ~lat, 
             clusterOptions = markerClusterOptions(), 
             group = "Deaths") %>% 
  addMarkers(data = snow_pumps, 
             ~long, 
             ~lat, 
             group = "Pumps")
m_test

The above results in a much nicer map. Is this map informative? What does it tell you about the incidence of cholera and the location of the pumps?

3.7 Some Simple Spatial Analysis

Despite the simplicity of this map, we can begin to do some spatial analysis. For instance, we could create a heatmap. You have probably seen heatmaps in many different situations before, as they are a popular visualization tool. Heatmaps are created based on a spatial analytical technique called kernel analysis. We will cover this technique in more detail later on. For the time being, it can be illustrated by taking advantage of the leaflet.extras package, which contains a heatmap function. Load the package as follows:

if (!require("leaflet.extras")) install.packages('leaflet.extras')
## Warning: package 'leaflet.extras' was built under R version 4.2.2
library(leaflet.extras)

Next, create a second leaflet object for this example, and call it m2. Notice that we are using the same setView parameters:

m2 <- leaflet(data = snow_deaths) %>% 
  setView(lng = -0.136, 
          lat = 51.513, 
          zoom = 16) %>% 
  addTiles()

Then, add the heatmap. The function used to do this is addHeatmap. We specify the coordinates and the variable for the intensity (i.e., each case in the dataframe is representative of Count deaths at the address). Two parameters are important here, the blur and the radius. If you are working with the R notebook version of the book, experiment changing these parameters:

# The 'addHeatmap' function is making a heat map. We specify the coordinates, 
# same as the block of code above. The 'intensity' function sets a numeric value, 
# the 'blur' specifies the amount of blur to apply and the 'radius' function sets
# the radius of each point on the heatmap 
m2 %>% 
  addHeatmap(lng = ~long, 
             lat = ~lat, 
             intensity = ~Count,
             blur = 40, 
             max = 1, 
             radius = 25)

Lastly, you can also add markers for the pumps as follows:

m2 %>% addHeatmap(lng = ~long, 
                  lat = ~lat, 
                  intensity = ~Count,
                  blur = 40, 
                  max = 1, 
                  radius = 25) %>%
  addMarkers(data = snow_pumps, 
             ~long, 
             ~lat, 
             group = "Pumps")

And everything together:

m2_test <- leaflet(data = snow_deaths) %>%
  setView(lng = -0.136, 
          lat = 51.513, 
          zoom = 16) %>% 
  addTiles() %>% 
  addHeatmap(lng = ~long,
             lat = ~lat, 
             intensity = ~Count,
             blur = 40, 
             max = 1, 
             radius = 25) %>% 
  addMarkers(data = snow_deaths, 
             ~long, 
             ~lat, 
             clusterOptions = markerClusterOptions(), 
             group = "Deaths") %>% 
  addMarkers(data = snow_pumps,
             ~long, 
             ~lat, 
             group = "Pumps")
m2_test

A heatmap (essentially a kernel density of spatial points; more on this in a later chapter) makes it very clear that most cases of cholera happend in the neighborhood of one (possibly contaminated) water pump! At the time, Snow noted with respect to this geographical pattern that:

“It will be observed that the deaths either very much diminished, or ceased altogether, at every point where it becomes decidedly nearer to send to another pump than to the one in Broad street. It may also be noticed that the deaths are most numerous near to the pump where the water could be more readily obtained.”

Snow’s analysis helped to convince officials to close the pump, after which the cholera outbreak subsided. This illustrates how even some relatively simple spatial analysis can help to inform public policy and even save lives. You can read more about this case here.

In this practice you have learned how to implement some simple mapping and spatial statistical analysis using R. In future readings we will further explore the potential of R for both.

3.8 Other Resources

For more information on the functionality of leaflet, please check Leaflet for R

4 Activity: Statistical Maps I

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

4.1 Housekeeping Questions

Answer the following questions:

  1. What are the office hours of your instructor this term?

  2. How are assignments graded?

  3. What is the policy for late assignments in this course?

4.2 Learning Objectives

In this activity you will:

  1. Discuss statistical maps and what makes them interesting.

4.3 Preliminaries

In the practice that preceded this activity, you used ggmap to create a proportional symbol map, a mapping technique used in spatial statistics for visualization of geocoded event information. As well, you implemented a simple technique called kernel analysis to the map to explore the distribution of events in the case of the cholera outbreak of Soho in London in 1854. Geocoded events are often called point patterns, so with the cholera data you were working with a point pattern.

In this activity, we will map another type of spatial data, called areal data. Areas are often administrative or political jurisdictions.

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(sf)
## Warning: package 'sf' was built under R version 4.2.2
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidyverse)

4.4 Creating a simple thematic map

If you successfully loaded package isdas a dataset called HamiltonDAs should be available for analysis:

data(HamiltonDAs)

Check the class of this object:

class(HamiltonDAs)
## [1] "sf"         "data.frame"

As you can see, this is an object of class sf, which stands for simple features. Objects of this class are used in the R package sf (see here) to implement standards for spatial objects.

You can examine the contents of the dataset by means of head (which will show the top rows):

head(HamiltonDAs)
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 563306.2 ymin: 4777681 xmax: 610844.5 ymax: 4793682
## Projected CRS: NAD83 / UTM zone 17N
##     ID GTA06       VAR1      VAR2      VAR3      VAR4      VAR5                       geometry
## 1 2671  5030 0.74650172 0.2596975 0.6361925 0.2290084 0.7223464 MULTIPOLYGON (((605123.4 47...
## 2 2716  5077 0.78107142 0.4413119 0.5690740 0.8997258 0.4163702 MULTIPOLYGON (((606814 4784...
## 3 2710  5071 0.78824936 0.4632757 0.4197216 0.1619401 0.3052948 MULTIPOLYGON (((605293 4785...
## 4 2745  5108 0.82064933 0.6365193 0.9504535 0.4992477 0.6046399 MULTIPOLYGON (((607542.7 47...
## 5 2810  5177 0.09131849 0.4455965 0.3539603 0.4919869 0.6366968 MULTIPOLYGON (((564681.8 47...
## 6 2740  5103 0.22257665 0.6288826 0.1341962 0.6635202 0.4429712 MULTIPOLYGON (((574373.4 47...

Or obtain the summary statistics by means of summary:

summary(HamiltonDAs)
##        ID          GTA06          VAR1             VAR2             VAR3             VAR4             VAR5       
##  2299   :  1   4050   :  1   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  2300   :  1   4051   :  1   1st Qu.:0.3680   1st Qu.:0.3800   1st Qu.:0.3521   1st Qu.:0.2989   1st Qu.:0.2998  
##  2301   :  1   4052   :  1   Median :0.5345   Median :0.4937   Median :0.5699   Median :0.5476   Median :0.4810  
##  2302   :  1   4053   :  1   Mean   :0.5241   Mean   :0.4966   Mean   :0.5548   Mean   :0.5325   Mean   :0.5001  
##  2303   :  1   4054   :  1   3rd Qu.:0.6938   3rd Qu.:0.6091   3rd Qu.:0.7378   3rd Qu.:0.7894   3rd Qu.:0.6915  
##  2304   :  1   4055   :  1   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  (Other):291   (Other):291                                                                                       
##           geometry  
##  MULTIPOLYGON :297  
##  epsg:26917   :  0  
##  +proj=utm ...:  0  
##                     
##                     
##                     
## 

The above will include a column for the geometry of the spatial features.

The dataframe includes all Dissemination Areas (or DAs for short) for the Hamilton Census Metropolitan Area in Canada. DAs are a type of geography used by the Census of Canada, in fact the smallest geography that is publicly available.

To create a simple map we can use ggplot2, which previously we used to map points. Now, the geom for objects of class sf can be used to plot areas. To create such a map, we layer a geom object of type sf on a ggplot2 object. For instance, to plot the DAs:

ggplot(HamiltonDAs) + 
  geom_sf(fill = "gray", color = "black", alpha = .3, size = .3)

We selected color “black” for the polygons, with a transparency alpha = 0.3 (alpha = 0 is completely transparent, alpha = 1 is completely opaque, try it!), and line size 0.3.

This map only shows the DAs, which is nice. However, as you saw in the summary of the dataframe above, in addition to the geometric information, a set of (generic) variables is also included, called VAR1, VAR2,…, VAR5.

Thematic maps can be created using these variables. The next chunk of code plots the DAs and adds info. The fill argument is used to select a variable to color the polygons. The function cut_number is used to classify the values of the variable in \(k\) groups of equal size, in this case 5 (notice that the lines of the polygons are still black). The scale_fill_brewer function can be used to select different palettes or coloring schemes):

ggplot(HamiltonDAs) +
  geom_sf(aes(fill = cut_number(HamiltonDAs$VAR1, 5)), color = "black", alpha = 1, size = .3) +
  scale_fill_brewer(palette = "Reds") +
  coord_sf() +
  labs(fill = "Variable")
## Warning: Use of `HamiltonDAs$VAR1` is discouraged. Use `VAR1` instead.

Now you have seen how to create a thematic map with polygons (areal data), you are ready for the following activity.

4.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Create thematic maps for variables VAR1 through VAR5 in the dataframe HamiltonDAs. Remember that you can introduce new chunks of code.

Activity Part II

  1. Imagine that the maps you produced in the first question were found, and for some reason the variables were not labeled. They may represent income, or population density, or something else. Which of the five maps you just created is more interesting? Rank the five maps from most to least interesting. Explain the reasons for your ranking.

5 Mapping in R: Continued

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In the preceding chapters, you were introduced to the following concepts:

  1. Basic operations in R. These include arithmetic and logical operations, among others.
  2. Data classes in R. Data can be numeric, characters, logical values, etc.
  3. Data types in R. Ways to store data, for instance as vector, matrix, dataframes, etc.
  4. Indexing. Ways to retrieve information from a data frame by referring to its location therein.
  5. Creating simple maps in R.

Please review the previous practices if you need a refresher on these concepts.

5.1 Learning Objectives

In this reading, you will learn:

  1. How to quickly summarize the descriptive statistics of a dataframe.
  2. More about factors.

Factors are a class of data that is used for categorical data. For instance, a parcel may be categorizes as developed or undeveloped; a plot of land may be zoned for commercial, residential, or industrial use; a sample may be mineral x or y. These are not quantities but rather reflect a quality of the entity that is being described.

  1. How to subset a dataset.

Sometimes you want to work with only a subset of a dataset. This can be done using indexing with logical values, or using specialized functions.

  1. More on the use of pipe operators.

A pipe operator allows you to pass the results of a function to another function. It makes writing instructions more intuitive and simple. You have already seen pipe operators earlier: they look like this %>%.

  1. You will add layers to a ggplot object to improve a map.

5.2 Suggested Readings

  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapters 2-3. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 3. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 1-3. John Wiley & Sons: New Jersey.

5.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(tidyverse)
library(isdas)

Now that your workspace is clear, you can proceed to invoke the sample dataset. You can do this by means of the function data.

data("missing_df")

The dataframe missing_df includes \(n = 65\) observations (Note: text between $ characters is mathematical notation in LaTeX). These observations are geocoded using a false origin and coordinates normalized to the unit-square (the extent of their values is between zero and one). The coordinates are x and y.

In addition, there are three variables associated with the locations (VAR1, VAR2, VAR3). The variables are generic. Feel free to think of them as housing prices, concentrations in ppb of some contaminant or any other variable that will help clarify your understanding. Finally, a factor variable states whether the variables were measured for a location: if the status is “FALSE”, the values of the variables are missing.

5.4 Summarizing a Dataframe

Obtaining a set of descriptive statistics for a dataframe is very simple thanks to the function summary. For instance, the summary of missing_df is:

# `summary()` reports basic descriptive statistics of columns in a data frame
summary(missing_df)
##        x                 y                VAR1             VAR2             VAR3         Observed 
##  Min.   :0.01699   Min.   :0.01004   Min.   :  50.0   Min.   :  50.0   Min.   :  50.0   FALSE: 5  
##  1st Qu.:0.22899   1st Qu.:0.19650   1st Qu.: 453.3   1st Qu.: 570.1   1st Qu.: 630.3   TRUE :60  
##  Median :0.41808   Median :0.50822   Median : 459.1   Median : 574.4   Median : 640.0             
##  Mean   :0.49295   Mean   :0.46645   Mean   : 458.8   Mean   : 562.1   Mean   : 638.1             
##  3rd Qu.:0.78580   3rd Qu.:0.74981   3rd Qu.: 465.4   3rd Qu.: 594.2   3rd Qu.: 646.0             
##  Max.   :0.95719   Max.   :0.98715   Max.   :1050.0   Max.   :1050.0   Max.   :1050.0             
##                                      NA's   :5        NA's   :5        NA's   :5

This function reports the minimum, maximum, mean, median, and quantile values of a numeric variable. When variables are characters or factors, their frequency is reported. For instance, in missing_df, there are five instances of FALSE and sixty instances of TRUE.

5.5 Factors

A factor describes a category. You can examine the class of a variable by means of the function class. From the summary, it is clear that several variables are numeric. However, for Observed, it is not evident if the variable is a character or factor. Use of class reveals that it is indeed a factor:

class(missing_df$Observed)
## [1] "factor"

Factors are an important data type because they allow us to store information that is not measured as a quantity. For example, the quality of the cut of a diamond is categorized as Fair < Good < Very Good < Premium < Ideal. Sure, we could store this information as numbers from 1 to 5. However, the quality of the cut is not a quantity, and should not be treated like one.

In the dataframe missing_df, the variable Observed could have been coded as 1’s (for missing) and 2’s (for observed), but this does not mean that “observed” is twice the amount of “missing”! In this case, the numbers would not be quantities but labels. Factors in R allow us to work directly with the labels.

Now, you may be wondering what does it mean when the status of a datum’s Observed variable is coded as FALSE. If you check the summary again, there are five cases of NA in the variables VAR1 through VAR3. NA essentially means that the value is missing. Likely, the five NA values correspond to the five missing observations. We can check this by subsetting the data.

5.6 Subsetting Data

We subset data when we wish to work only with parts of a dataset. We can do this by indexing. For example, we could retrieve the part of the dataframe that corresponds to the FALSE values in the Observed variable:

missing_df[missing_df$Observed == FALSE,]
##       x    y VAR1 VAR2 VAR3 Observed
## 61 0.34 0.83   NA   NA   NA    FALSE
## 62 0.29 0.52   NA   NA   NA    FALSE
## 63 0.13 0.32   NA   NA   NA    FALSE
## 64 0.62 0.10   NA   NA   NA    FALSE
## 65 0.88 0.85   NA   NA   NA    FALSE

Data are indexed by means of the square brackets [ and ]. The indices correspond to the rows and columns. The logical statement missing_df$Observed == False selects the rows that meet the condition, whereas leaving a blank for the columns simply means “all columns”.

As you can see, the five NA values correspond, as anticipated, to the locations where Observed is FALSE.

Using indices is only one of many ways of subsetting data. Base R also has a subset command, that is implemented as follows:

subset(missing_df, Observed == FALSE)
##       x    y VAR1 VAR2 VAR3 Observed
## 61 0.34 0.83   NA   NA   NA    FALSE
## 62 0.29 0.52   NA   NA   NA    FALSE
## 63 0.13 0.32   NA   NA   NA    FALSE
## 64 0.62 0.10   NA   NA   NA    FALSE
## 65 0.88 0.85   NA   NA   NA    FALSE

And the package dplyr (part of the tidyverse) has a function called filter:

filter(missing_df, Observed == FALSE)
##      x    y VAR1 VAR2 VAR3 Observed
## 1 0.34 0.83   NA   NA   NA    FALSE
## 2 0.29 0.52   NA   NA   NA    FALSE
## 3 0.13 0.32   NA   NA   NA    FALSE
## 4 0.62 0.10   NA   NA   NA    FALSE
## 5 0.88 0.85   NA   NA   NA    FALSE

The three approaches give the same result, but subset and filter are somewhat easier to write. You could nest any of the above approaches as part of another function. For instance, if you wanted to do a summary of the selected subset of the data, you would:

summary(filter(missing_df, Observed == FALSE))
##        x               y              VAR1          VAR2          VAR3      Observed
##  Min.   :0.130   Min.   :0.100   Min.   : NA   Min.   : NA   Min.   : NA   FALSE:5  
##  1st Qu.:0.290   1st Qu.:0.320   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   TRUE :0  
##  Median :0.340   Median :0.520   Median : NA   Median : NA   Median : NA            
##  Mean   :0.452   Mean   :0.524   Mean   :NaN   Mean   :NaN   Mean   :NaN            
##  3rd Qu.:0.620   3rd Qu.:0.830   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA            
##  Max.   :0.880   Max.   :0.850   Max.   : NA   Max.   : NA   Max.   : NA            
##                                  NA's   :5     NA's   :5     NA's   :5

Or:

summary(missing_df[missing_df$Observed == FALSE,])
##        x               y              VAR1          VAR2          VAR3      Observed
##  Min.   :0.130   Min.   :0.100   Min.   : NA   Min.   : NA   Min.   : NA   FALSE:5  
##  1st Qu.:0.290   1st Qu.:0.320   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   TRUE :0  
##  Median :0.340   Median :0.520   Median : NA   Median : NA   Median : NA            
##  Mean   :0.452   Mean   :0.524   Mean   :NaN   Mean   :NaN   Mean   :NaN            
##  3rd Qu.:0.620   3rd Qu.:0.830   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA            
##  Max.   :0.880   Max.   :0.850   Max.   : NA   Max.   : NA   Max.   : NA            
##                                  NA's   :5     NA's   :5     NA's   :5

Nesting functions makes it difficult to read the code, since functions are evaluated from the innermost to the outermost function, whereas we are used to reading from left to right. Fortunately, R implements (as part of package magrittr which is required by tidyverse) a so-called pipe operator that simplifies things and allows for code that is more intuitive to read.

5.7 Pipe Operator

A pipe operator is written this way: %>%. Its objective is to pass forward the output of a function to a second function, so that they can be chained to create more complex instructions that are still relatively easy to read.

For instance, instead of nesting the subsetting instructions in the summary function, you could do the subsetting first, and pass the results of that to the summary for further processing. This would look like this:

# Remember, the pipe operator `%>%` passes the value of the left-hand side 
# to the function on the right-hand side
subset(missing_df, Observed == FALSE) %>% summary()
##        x               y              VAR1          VAR2          VAR3      Observed
##  Min.   :0.130   Min.   :0.100   Min.   : NA   Min.   : NA   Min.   : NA   FALSE:5  
##  1st Qu.:0.290   1st Qu.:0.320   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   TRUE :0  
##  Median :0.340   Median :0.520   Median : NA   Median : NA   Median : NA            
##  Mean   :0.452   Mean   :0.524   Mean   :NaN   Mean   :NaN   Mean   :NaN            
##  3rd Qu.:0.620   3rd Qu.:0.830   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA            
##  Max.   :0.880   Max.   :0.850   Max.   : NA   Max.   : NA   Max.   : NA            
##                                  NA's   :5     NA's   :5     NA's   :5

The code above is read as “subset missing_df and pass the results to summary”. Pipe operators make writing and reading code somewhat more natural.

5.8 More on Visualization

Observations in the sample dataset are geo-referenced, and so they can be plotted. Since they are based on false origins and are normalized, we cannot map them to the surface of the Earth. However, we can still visualize their spatial distribution. This can be done by using ggplot2. For instance, for missing_df:

# `coord_fixed()` forces the plot to use a ratio of 1:1 for the units 
# in the x- and y-axis; in this case, since the values we are mapping 
# to those axes are coordinates, we wish to represent them using the 
# same scale, i.e., one unit in x looks identical to one unit in y 
# (as an experiment, repeat the plot without fixing the coordinates) 
ggplot() + 
  geom_point(data = missing_df, 
             aes(x = x, 
                 y = y), 
             shape = 17, 
             size = 3) + 
  coord_fixed()

The above simply plots the coordinates, so that we can see the spatial distribution of the observations. (Notice the use of coord_fixed to maintain the aspect ratio of the plot to 1, i.e. the relationship between width and height). You have control of the shape of the markers, as well as their size. You can consult the shapes available here. Experiment with different shapes and sizes if you wish.

The dataframe missing_df includes more attributes that could be used in the plot. For instance, if you wished to create a thematic map showing VAR1 you would do the following:

ggplot() + 
  geom_point(data = missing_df,
             aes(x = x, 
                 y = y,
                 color = VAR1), 
             shape = 17, 
             size = 3) + 
  coord_fixed()

The shape and size assignments happen outside of aes, and so are applied equally to all observations. In some cases, you might want to let other aesthetic attributes vary with the values of a variable in the dataframe. For instance, if we let the sizes change with the value of the variable:

ggplot() + 
  geom_point(data = missing_df, 
             aes(x = x, 
                 y = y, 
                 color = VAR1, 
                 size = VAR1), 
             shape = 17) + 
  coord_fixed()
## Warning: Removed 5 rows containing missing values (geom_point).

Note how there is a warning, saying that five observations were removed because data were missing! These are likely the five locations where Observed == FALSE!

To make it more clear which observations are these, you could set the shape to vary according to the value of Observed, as follows:

ggplot() + 
  geom_point(data = missing_df,
             aes(x = x, 
                 y = y, 
                 color = VAR1, 
                 shape = Observed), 
             size = 3) +
  coord_fixed()

Now it is easy to see the locations of the five observations that were Observed == FALSE!, which are labeled with gray circles.

You can change the coloring scheme by means of scale_color_distiller (you can can check the different color palettes available here):

ggplot() + 
  geom_point(data = missing_df, 
             aes(x = x,
                 y = y, 
                 color = VAR1,
                 shape = Observed),
             size = 3) +
  scale_color_distiller(palette = "RdBu") +
  coord_fixed()

You will notice maybe that with this coloring scheme some observations become very light and difficult to distinguish from the background. This can be solved in many different ways (for instance, by changing the color of the background!). A simple fix is to add a layer with hollow symbols, as follows:

ggplot() + 
  geom_point(data = missing_df, 
             aes(x = x, 
                 y = y, 
                 color = VAR1), 
             shape = 17, 
             size = 3) +
  geom_point(data = missing_df, 
             aes(x = x, 
                 y = y), 
             shape = 2, 
             size = 3) +
  scale_color_distiller(palette = "RdBu") +
  coord_fixed()

Finally, you could try subsetting the data to have greater control of the appearance of your plot, for instance:

ggplot() +
  geom_point(data = subset(missing_df, 
                           Observed == TRUE),
             aes(x = x, 
                 y= y, 
                 color = VAR1), 
             shape = 17, 
             size = 3) +
  geom_point(data = subset(missing_df, 
                           Observed == TRUE),
             aes(x = x, 
                 y= y), 
             shape = 2, 
             size = 3) +
  geom_point(data = subset(missing_df, 
                           Observed == FALSE),
             aes(x = x, 
                 y= y), 
             shape = 18, 
             size = 4) +
  scale_color_distiller(palette = "RdBu") +
  coord_fixed()

These are examples of creating and improving the aspect of simple symbol maps, which are often used to represent observations in space.

6 Activity 2: Statistical Maps II

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

6.1 Housekeeping Questions

Answer the following questions:

  1. How many examinations are there in this course?
  2. What is the date of the first examination?
  3. Where is the office of your instructor?

6.2 Learning objectives

In this activity you will:

  1. Learn about patterns and processes, including random patterns.
  2. Understand the general approach to retrieve a process from a pattern.
  3. Discuss the importance of discriminating random patterns.

6.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 1-3. John Wiley & Sons: New Jersey.

6.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(tidyverse)
library(isdas)

Now that your workspace is clear, you can proceed to invoke the datasets required for this activity:

data("missing_df")
data("PointPattern1")
data("PointPattern2")
data("PointPattern3")

The datasets include the following dataframe which will be used in the first part of the activity:

  • missing_df

This dataframe includes \(n = 65\) observations (Note: text between $ characters is mathematical notation in LaTeX). These observations are geocoded using a false origin and coordinates normalized to the unit-square (the extent of their values is between zero and one). The coordinates are x and y. In addition, there are three variables associated with the locations (VAR1, VAR2, VAR3). The variables are generic. Feel free to think of them as if they were housing prices or concentrations in ppb of some contaminant. Finally, a factor variable states whether the variables were measured for a location: if the status is “FALSE”, the values of the variables are missing.

The following dataframes will be used in the second part of the activity:

  • PointPattern1
  • PointPattern2
  • PointPattern3

The dataframes PointPattern* are locations of some generic events. The coordinates x and y are also based on a false origin and are normalized to the unit-square. Feel free to think of these events as cases of flu, the location of trees of a certain species, or the location of fires.

6.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Create thematic maps for variables VAR1 through VAR3 in the dataframe missing_df.

  2. Plot all three point patterns.

Activity Part II

  1. Suppose that you were tasked with estimating the value of a variable for the locations where those were not measured. For instance, you could be a realtor, and you need to assess the value of a property, and the only information available is the published values of other properties in the region. As an alternative, you could be an environmental scientist, and you need to estimate what the concentration of a contaminant at a site, based on previous measurements at other sites in the region. Propose one or more ways to guess those missing values, and explain your reasoning. The approach does not need to be the same for all variables!

  2. Imagine that you are a public health official and you need to plan services to the public. If you were asked to guess where the next event would emerge, where would be your guess in each map? Explain your answer.

7 Maps as Processes: Null Landscapes, Spatial Processes, and Statistical Maps

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In last practice your learning objectives were:

  1. How to obtain a descriptive summary of a dataframe.
  2. Factors and how to use them.
  3. How to subset a dataframe.
  4. Pipe operators and how to use them.
  5. How to improve your maps.

Please review the previous practices if you need a refresher on these concepts.

7.1 Learning Objectives

In this chapter, you will learn:

  1. How to generate random numbers with different properties.
  2. About Null Landscapes.
  3. About stochastic processes.
  4. How to create new columns in a dataframe using a formula.
  5. How to simulate a spatial process.

7.2 Suggested Readings

  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Analysing Spatial Data (pp. 169-171). Springer: New York.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 4. John Wiley & Sons: New Jersey.

7.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(tidyverse)

7.4 Random Numbers

Colloquially, we understand random as something that happens in an unpredictable manner. The same word in statistics has a precise meaning, as the outcome of a process that cannot be predicted with any form of certainty.

The question whether random processes exist is philosophically interesting. In the early stages of the invention of science, there was much optimism that humans could one day understand every aspect of the universe. This notion is well illustrated by Laplace’s Demon, a hypothetical entity that could predict the state of the universe in the future based on an all-encompassing knowledge of the state of the universe at any past point in time (see here).

There are two important limitations to this perspective. First, there is the assumption that the mechanisms of operation of phenomena are well understood (in the case of Laplace’s Demon, it was somewhat naively assumed that classical Newtonian mechanics were sufficient). And secondly, the assumption that all relevant information is available to the observer.

There are many processes in reality that are not fully understood, which make Laplace’s Demon an interesting, but unreliable source on predicting the state of the universe. Furthermore, there are often constraints in terms of how much and how accurately information can be collected with respect to any given phenomenon.

##Types of Processes

A process can be deterministic. However, When limited knowledge/limited information prevent us from being able to make certain predictions, we assume that the process is random.

It is important to note that “random” does not mean that just any outcome is possible. For instance, if you flip a coin, there are only two possible outcomes. If you roll a dice, there are only six possible outcomes. The concentration of a pollutant cannot be negative. The height of a human adult cannot be zero or 10 meters. And so on. It is the result of the possible outcomes that is random, as there is no process controlling the respective outcome.

Over time, many formulas have been devised to describe different types of random processes. A random probability distribution function describes the probability of observing different outcomes.

For instance, a formula for processes similar to coin flips was discovered by Bernoulli in 1713 (see here).

The following function reports a random binomial variable. The number of observations n is how many random numbers we require. The size is the number of trials. For instance, if the experiment was flipping a coin, it would be how many times we get heads in size flips. The probability of success prob is the probability of getting heads in any given toss. Execute the chunk repeatedly to see what happens.

#This function simulates the outcome of flipping a coin. 
# Here, we are simulating the result for flipping heads, 
# which has a probability of 0.5. The value of `n` is the 
# number of experiments and `size` is the number of trials 
# in each experiment 
rbinom(n = 1, 
       size = 1, 
       prob = 0.5)
## [1] 1

It can be noted that although there are only two outcomes, we do not have control over the result of the process, making the result random. If you tried this “experiment” repeatedly, you would find that “heads” (1s) and “tails” (0s) appear each about 50% of the time. A way to implement this is to increase n- think of this as recruiting more people to do coin flips at the same time:

n <- 1000 # Number of people tossing the coin one time.
coin_flips <- rbinom(n = n, 
                     size = 1, 
                     prob = 0.5)
sum(coin_flips)/n
## [1] 0.485

What happens if you change the size to 0, and why?

The binomial function is an example of a discrete probability distribution function, because it can take only one of a discrete (limited) number of values (i.e., 0 and 1).

Other random probability distribution functions are for continuous variables, variables that can take any value within a predefined range. The most famous of this distributions is the normal distribution, which you may know also as the bell curve. This probability distribution is attributed to Gauss (see here).

The normal distribution is defined by a centering parameter (the mean of the distribution) and a spread parameter (the standard deviation). In the normal distribution, 68% of values are within one standard deviation from the mean, 95% of values are within two standard deviations from the mean, and 99.7% of values are within three standard deviations from the mean.

The following function reports a value taken at random from a normal distribution with mean zero and standard deviation sd of one. Execute this chunk repeatedly to see what happens:

# This function generates random numbers based on 
# the normal distribution conditional on the given 
# arguments, i.e., the mean and the standard deviation `sd`. 
rnorm(1, 
      mean = 0, 
      sd = 1)
## [1] -0.09784744

Let’s say that the average height of men in Canada is 170.7 cm and the standard deviation is 7 cm. The height of a random person in this population would be:

rnorm(1, 
      mean = 170.7, 
      sd = 7)
## [1] 175.246

And the distribution of heights of n men in this population would be:

#Creating a data frame using the random numbers generated 
# from n=1000 people. The results in the data frame are then 
# plotted using ggplot. The end result is a distribution of 
# heights of 1000 men. You are able to see which heights are 
# most common out of the sample.
n <- 1000
height <- rnorm(n, 
                mean = 170.7, 
                sd = 7)
height <- data.frame(height)

# `geom_histogram()` is a geometric object in `ggplot2` that 
# represents the frequency of values in a vector as a bar chart
ggplot(data = height, 
       aes(x = height)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Men shorter than 150 cm would be extremely rare, as well as men taller than 190 cm.

7.5 Null Landscapes

So what do random variables have to do with maps?

Random variables can be used to generate purely random maps. These are called null landscapes or neutral landscapes in spatial ecology (With and King 1997) (Paper is available to download).

The concept of null landscapes is quite useful. They provide a benchmark to compare the results of statistical maps. Let us see how to generate a null landscape of events.

Suppose that there is a landscape with coordinates in the unit square, that is divided in very small discrete units of land. Each of these units of land can be the location of an event. For example, a tree might be present; or a case of a disease.

Let’s first create a landscape. For this, we will use the expand.grid function to find all combinations of two sets of coordinates in the unit interval, using small partitions:

# expand.grid created a set of coordinates by obtaining 
# all the combinations of the input variables. Here, our 
# landscape ranges in the x-axis from 0 to 1, increasing 
# by 0.05, and the y-axis also from 0 to 1, increasing by 0.05
coords <- expand.grid(x = seq(from = 0, 
                              to = 1, 
                              by = 0.05),
                      y = seq(from = 0, 
                              to = 1, 
                              by = 0.05))

Now, let’s generate a binomial random variable to go with these coordinates.

# `nrow()` returns the number of rows that are present 
# in a data frame. Here, it returns the number of rows 
# in the data frame `coords` 
events <- rbinom(n = nrow(coords), 
                 size = 1, 
                 prob = 0.5)

We will collect the coordinates and the random variable in a dataframe for plotting:

# `data.frame()` collects the inputs in a data frame;
# they must have the same number of rows
null_pattern <- data.frame(coords, 
                           events)

We can plot the null landscape we just generated as follows:

ggplot() + 
  geom_point(data = filter(null_pattern, 
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  coord_fixed()

By changing the probability prob in the function rbinom you can make the event more or less likely, i.e., frequent. If you are working with the notebook version of this document you can try changing the parameters to see what happens.

A continuous random variable can also be used to generate a null landscape. For instance, imagine that a group of individuals are asked to stand in formation, and that they arrange themselves purely at random. What would a map of their heights look like? First, we will generate a random variable using the same parameters we mentioned above for the height of men in Canada:

#heights will be random numbers generated based on 
# the average height of men, 7 standard deviations, 
# and the null landscape "coords" created previously.
heights <- rnorm(n = nrow(coords), 
                 mean = 170.7, 
                 sd = 7)

The random values that were generated can be collected in a dataframe with the coordinates for the purpose of plotting:

null_trend <- data.frame(coords, 
                         heights)

One possible map of heights when the individuals stand in formation at random would look like this:

# Our plot is created based on the dataframe of coords and heights. 
# The value of `x` is plotted to the x-axis, the value of `y` is plotted 
# to the y-axis, and the color of the points depends on the values of 
# `heights`. We can change the _scale_ of colors by means of 
# `scale_color_distiller()`. There, palette `spectral` associates higher 
# values of `heights` as red (taller men), while lower values of `heights` 
# (i.e., shorter men) are appear in blue. More generally, we can control 
# the scale of aesthetic aspects of the plot by means of scale_*something* 
# (scale_shape, scale_size, etc.) 
ggplot() + 
  geom_point(data = null_trend, 
             aes(x = x, 
                 y = y, 
                 color = heights), 
             shape = 15) +
  scale_color_distiller(palette = "Spectral") +
  coord_fixed()

These two examples illustrate only two of many possible techniques to generate null landscapes. We will discuss other strategies to work with null landscapes later in the course.

7.6 Stochastic Processes

Some processes are random, such as the ones used above to create null landscapes. These processes take values with some probability, but cannot be predicted with any certainty.

We will illustrate this, using again a unit square:

# Remember that `expand.grid()` will find all combinations of values in the inputs
coords <- expand.grid(x = seq(from = 0, 
                              to = 1, 
                              by = 0.05), 
                      y = seq(from = 0, 
                              to = 1, 
                              by = 0.05))

Here is an example of a random pattern of events:

# Create a random variable and join to the coordinates to generate a null landscape
events <- rbinom(n = nrow(coords), 
                 size = 1, 
                 prob = 0.5)
null_pattern <- data.frame(coords, 
                           events)

# Plot the null landscape you just created
ggplot() + 
  geom_point(data = subset(null_pattern, 
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  coord_fixed()

A systematic or deterministic process is one that contains no elements of randomness, and can therefore be predicted with complete certainty. For instance (note the use of xlim to set the extent of x axis in the plot):

# Copy the coordinates to a new object
deterministic_point_pattern <- coords

# `mutate()` adds new variables to a data frame while preserving 
# existing variables. Here, we create a new column in our data frame, 
# called `events` that will take the value of `x` (the position of an 
# observation along the x-axis) and will `round()` it, i.e., if it is 
# less than 0.5 it will round it to zero, and if it is equal to or 
# greater than 0.5 it will round to 1
deterministic_point_pattern <- mutate(deterministic_point_pattern, 
                                      events = round(x))

# Plot the new landscape: `filter()` keeps the rows in a dataframe 
# that meet a condition (for example, that the value of `events` is 1),
# and discards the rest
ggplot() + 
  geom_point(data = filter(deterministic_point_pattern, 
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  xlim(0, 1) +
  coord_fixed()

In the process above, we used the function round() and the coordinate x. The function gives a value of one for all points with x > 0.5, and a value of zero to all points with x <= 0.5. The pattern is fully deterministic: if I know the value of the x coordinate I can predict whether an event will be present.

A stochastic process, on the other hand, is a process that is neither fully random or deterministic, but rather a combination of the two. Let’s illustrate:

# Copy the coordinates to a new object 
stochastic_point_pattern <- coords

# Here, we combine the function `round()`, which is deterministic operation, 
# and `rbinom()` to generate a random number
stochastic_point_pattern <- 
  mutate(stochastic_point_pattern, 
         events = round(x) - round(x) * rbinom(n = nrow(coords), 
                                               size = 1, 
                                               prob = 0.5))

# Plot the new landscape
ggplot() + 
  geom_point(data = subset(stochastic_point_pattern,
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  xlim(0, 1) +
  coord_fixed()

The process above has a deterministic component (the probability of an event is zero if x <= 0.5), and a random component (the probability of a coordinate being an event is 0.5 when x > 0.5). The landscape is not fully random, but also it is not fully deterministic. Instead, it is the result of a stochastic process, a process that combines deterministic and random elements.

7.7 Simulating Spatial Processes

Null landscapes are interesting as a benchmark. More interesting are landscapes that emerge as the outcome of a non-random process - either a systematic/deterministic or stochastic process. Here we will see more ways to introduce a systematic element into a null landscape to simulate spatial processes.

Let’s begin with the point pattern, using the same landscape that we used above. We will first copy the coordinates of the landscape to a new dataframe, that we will call pattern1:

# Copy the coordinates to a new object, called `pattern1`
pattern1 <- coords

Next, we will use the function mutate from the dplyr package that is part of the tidyverse. This function adds a column to a data frame that could be calculated using a formula. For instance, we will now make the probability prob of the random binomial number generator a function of the coordinates:

# Remember, mutate adds a new column to a data frame. In this 
# example, mutate creates a new column, `events` using random
# binomial values; however, notice that the `prob` is not 0.5! 
# Instead, it depends on `x` the position of the event on the x-axis 
pattern1 <- mutate(pattern1, 
                   events = rbinom(n = nrow(pattern1), 
                                   size = 1, 
                                   prob = (x)))

Plot this pattern:

ggplot() + 
  geom_point(data = subset(pattern1, 
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  coord_fixed()

Since the probability of a “success” in the binomial experiment is proportional to the value of x (the coordinate of the event), now the events are clustered to the right of the plot. The underlying process in this case can be described in simple terms as “the probability of an event increases in the east direction”. In a real process, this could be possibly as a result of wind conditions, soil fertility, or other environmental factors that follow a trend.

Let us see what happens when we make this probability a function of the y coordinate:

# Overwrite the `events`, now the probability of success 
# in the random binomial number generator is a function of 
# `y`, the position of the event on the y-axis 
pattern1 <- mutate(pattern1, 
                   events = rbinom(n = nrow(pattern1), 
                                   size = 1, 
                                   prob = (y)))

# Plot the new events
ggplot() + 
  geom_point(data = subset(pattern1, 
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  coord_fixed()

Since the probability of a “success” in the binomial experiment is proportional to the value of y (the coordinate of the event), now the events are clustered to the top. The probability could be the interaction of the two coordinates:

# Now the probability is the product of `x` and `y`
pattern1 <- mutate(pattern1, 
                   events = rbinom(n = nrow(pattern1), 
                                   size = 1, 
                                   prob = (x * y)))

# Plot
ggplot() + 
  geom_point(data = subset(pattern1, 
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  coord_fixed()

Which of course means that the events cluster on the top-right corner.

A somewhat more sophisticated example could make the probability a function of distance from the center of the region:

# Copy the coordinates to the object `pattern1`
pattern1 <- coords

# In this case, `mutate()` creates a new variable, `distance`, 
# which is the straight line distance from the center of the 
# region (at coordinates x = 0.5 and y = 0.5). Now the probability 
# of success in the random binomial number generator depends on this `distance` 
pattern1 <- mutate(pattern1, 
                   distance = sqrt((0.5 - x)^2 + (0.5 - y)^2), 
                   events = rbinom(n = nrow(pattern1), 
                                   size = 1, 
                                   prob = 1 - exp(-0.5 * distance)))

Do not worry too much about the formula that I selected to generate this process; we will see different tools to describe a spatial process. In this particular example, I selected a function that makes the probability increase with distance from the center of the region.

Plot this pattern:

ggplot() + 
  geom_point(data = subset(pattern1, 
                           events == 1), 
             aes(x = x, 
                 y = y), 
             shape = 15) +
  coord_fixed()

As you would expect, there are few events near the center, and the number of events tends to increase away from the center.

To conclude this practice, let’s revisit the example of the people standing in formation. Now, taller people are asked to stand towards the back of the formation (assuming that the back is in the positive direction of the y-axis). As a result of this instruction, now the sorting is not random, since taller people tend to stand towards the back. However, people are not able to assess the height of each other exactly, so there will be some random variation in the distribution of heights. We can simulate this by making the height a function of position.

First, we copy the coordinates to a new dataframe for our trend experiment:

trend1 <- coords

Again we use mutate to add a column to a data frame that could be calculated using a formula. For instance, we will now make the probability prob of the random binomial number generator a function of the coordinates:

trend1 <- mutate(trend1,
                 heights = 160 + 20 * y  + rnorm(n = nrow(pattern1), 
                                                 mean = 0, 
                                                 sd = 7))

If people have a preference for standing next to people about their same height, and shorter people have a preference for standing near the front, this is a possible map of heights in the formation:

ggplot() + 
  geom_point(data = trend1, aes(x = x, 
                                y = y, 
                                color = heights), 
             shape = 15) +
  scale_color_distiller(palette = "Spectral") +
  coord_fixed()

As expected, shorter people are towards the “front” (bottom of the plot) and taller people towards the back. It is not a uniform process, since there is still some randomness, but a trend can be clearly appreciated.

7.8 Processes and Patterns

O’Sullivan and Unwin (2010) make an important distinction between processes and patterns. A process is like a recipe, a sequence of events or steps, that leads to an outcome, that is, a pattern.

You can think of the simulation procedures above as having two components: the process is the formula, function, or algorithm used to simulate a pattern. For instance, a random process could be based on the binomial distribution, whereas a stochastic process would have in addition to a random component some deterministic elements. The pattern is the outcome of the process. In the case of spatial processes, the outcome is typically a statistical map.

The procedures in the preceding sections illustrate just a few different ways to simulate spatial processes with the aim of generating statistical maps that display spatial patterns. There are in fact many more ways to simulate spatial processes, and articles (e.g., Geyer and Møller 1994) - and even books (e.g., Moller and Waagepetersen 2003) - have been written on this topic! Simulation is a very valuable tool in spatial statistics, as we shall see in later chapters.

It is important to note, however, that in the vast majority of cases we do not actually know the process; that is precisely what we wish to infer. Understanding process generation in a statistical sense, as well as null landscapes, is a useful tool that can help us to infer processes in applications with empirical (as opposed to simulated) data. In this sense, spatial statistics is often a tool used to make decisions about spatial patterns: are they random? And, if they are not random, can we infer the underlying process?

8 Activity 3: Maps as Processes

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

8.1 Practice Questions

Answer the following questions:

  1. What is a Geographic Information System?
  2. What distinguishes a statistical map from other types of mapping techniques?
  3. What is a null landscape?

8.2 Learning Objectives

In this activity, you will:

  1. Simulate landscapes using various types of processes.
  2. Discuss the difference between random and non-random landscapes.
  3. Think about ways to decide whether a landscape is random.

8.3 Suggested Reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 4. John Wiley & Sons: New Jersey.

8.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the worspace.

Load the libraries you will use in this activity:

library(tidyverse)

In the practice that preceded this activity, you learned how to simulate null landscapes and spatial processes.

8.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. (*)Simulate and plot a landscape using a random, stochastic, or deterministic process. It is your choice whether to simulate a point pattern or a continuous variable. Identify the key parameters that make a landscape more or less random. Repeat several times changing those parameters.

Activity Part II

  1. Recreate any one of the maps you created and share the map with a fellow student. Ask them to guess whether the map is random or non-random.

  2. Repeat step 2 several times (between two and four times).

  3. Propose one or more ways to decide whether a landscape is random, and explain your reasoning. The approach does not need to be the same for point patterns and continuous variables!

(PART) Part III: Analysis of Point Patterns

9 Point Pattern Analysis I

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In last practice your learning objectives were:

  1. How to generate random numbers with different properties.
  2. About Null Landscapes.
  3. How to create new columns in a dataframe using a formula.
  4. How to simulate a spatial process.

Please review the previous practices if you need a refresher on these concepts.

9.1 Learning Objectives

In this practice, you will learn:

  1. A formal definition of point pattern.
  2. Processes and point patterns.
  3. The concepts of intensity and density.
  4. The concept of quadrats and how to create density maps.
  5. More ways to control the look of your plots, in particular faceting and adding lines.

9.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex.
  • Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 1, 1.1 - 1.2. CRC: Boca Raton.
  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

9.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(tidyverse)
library(spatstat)

Load the data that you will use for this practice:

data("PointPatterns")

Quickly check the contents of this dataframe:

summary(PointPatterns)
##        x                y                 Pattern  
##  Min.   :0.0169   Min.   :0.005306   Pattern 1:60  
##  1st Qu.:0.2731   1st Qu.:0.289020   Pattern 2:60  
##  Median :0.4854   Median :0.550000   Pattern 3:60  
##  Mean   :0.5074   Mean   :0.538733   Pattern 4:60  
##  3rd Qu.:0.7616   3rd Qu.:0.797850                 
##  Max.   :0.9990   Max.   :0.999808

The dataframe contains the x and y coordinates of four different patterns of points, each with \(n=60\) events.

9.4 Point Patterns

Previously you created different types of maps and learned about different kinds of processes (i.e., random, stochastic, deterministic). A map that you have seen in several occasions is one where the coordinates of an event of interest are available. The simplest kind of data of this type is called a point pattern. This occurs when only the coordinates are available.

A point pattern is given by a set of events of interest that are observed in a region \(R\).

A region has an infinite number of points, essentially coordinates \((x_i, y_i)\) on the plane. The number of points is infinite, because there is a point defined by, say, coordinates (1,1), and also a point for coordinates (1.1,1), and for coordinates for (1.01,1), and so on. Any location that can be described by a set of coordinates contained in the region is a point.

Not all points are events, however. An event is defined as a point where something of interest happened. This could be the location where a tree exists, or a crime happened, the epicenter of an earthquake, a case of a disease was reported, and so on. There might be one such occurrence, or more. Each event is be denoted by: \[ \textbf{s}_i \] with coordinates: \[ (x_i,y_i). \] Sometimes other attributes of the events have been measured as well. For example, the event could be an address where cholera was reported (as in John Snow’s famous map). In addition to the address (which can be converted into the coordinates of the event), the number of cases could be recorded. Other examples could be the height and diameter of trees, the magnitude of an earthquake, etc.

It is important, for reasons that will be discussed later, that the point pattern is a complete enumeration. What this means is that every event that happened has been recorded! Interpretation of most analyses becomes dubious if the events are only sampled, that is, if only a few of them have been observed and recorded.

9.5 Processes and Point Patterns

Point patterns are interesting in many applications. In these applications, a key question of interest is whether the pattern is random.

Imagine a point pattern that records crimes in a region. The pattern might be random, in which case there is no way to anticipate where the next occurrence of criminal activity will be. Non-random patterns, on the other hand, are likely the outcome of some meaningful process. For instance, crimes might cluster as a consequence of some common environmental variable (e.g., concentration of wealth). On the contrary, they might repeal each other (e.g., the location of a crime draws attention of law enforcement, and therefore the next occurrence of a crime tends to happen away from it). Deciding whether the pattern is random or not is the initial step towards developing hypotheses about the underlying process.

Consider for example the following patterns. To create the following figure, you can use faceting by means of ggplot2::facet_wrap():

# This uses function "ggplot" to plot data "PointPatterns" loaded 
# into the data frame earlier, by means of X and Y coordinates

# the function `facet_wrap()` is used to create multiple plots 
# according to one (or more) variables in the dataset. Here, it 
# is used to create individual plots for each of the four patterns 
# in the dataframe, but put them all in a single figure
ggplot() + 
  geom_point(data = PointPatterns, 
             aes(x = x, 
                 y = y)) + 
  facet_wrap(~ Pattern) +
  coord_fixed()

As you can see, faceting is a convenient way to simultaneously plot different parts of a dataframe (in the present case, the different Pattern’s).

In the preceding activity, you were asked to generate ideas regarding possible ways of deciding whether a map of events (i.e., a point pattern) is random. In this chapter we will formalize a specific way to do so, by considering the intensity of the process.

9.6 Intensity and Density

The intensity of a spatial point process is the expected number of events per unit area. This is conventionally denoted by the Greek letter \(\lambda\).

In most cases the process is not know, so its intensity cannot be directly measured. In its place, the density of the point pattern is taken as the empirical estimate of the intensity of the underlying process. The density of the point pattern is calculated very simply as the number of events divided by the area of the region, that is: \[ \hat{\lambda} = \frac{(S \in R)}{a} = \frac{n}{a}. \] Notice the use of the “hat” symbol on top of the Greek lambda. This symbol is called “caret”. The hat notation is used to indicate an estimated value of an unobserved parameter of a process as opposed to the true (but usually unknown) value. In the present case this is the intensity of the spatial point process.

Consider one of the point patterns in your sample dataset, say “Pattern 1”. If we filter for “Pattern 1” we can then summarize it:

filter(PointPatterns, Pattern == "Pattern 1") %>% summary()
##        x                y                 Pattern  
##  Min.   :0.0285   Min.   :0.005306   Pattern 1:60  
##  1st Qu.:0.3344   1st Qu.:0.236509   Pattern 2: 0  
##  Median :0.5247   Median :0.500262   Pattern 3: 0  
##  Mean   :0.5531   Mean   :0.500248   Pattern 4: 0  
##  3rd Qu.:0.8417   3rd Qu.:0.761218                 
##  Max.   :0.9888   Max.   :0.999808

We see that there are \(n = 60\) points in this dataset. Since the region is the unit square (check how the values of the coordinates range from approximately zero to approximately 1), the area of the region is 1. This means that for “Pattern 1”: \[ \hat{\lambda} = \frac{60}{1} = 60 \]

This is the overall density of the point pattern.

9.7 Quadrats and Density Maps

The overall density of a point process (calculated above) can be mapped by means of the geom_bin2d function of the ggplot2 package. This function divides two dimensional space into bins and reports the number of events or the density of the events in the bins. We will give this a try next:

# `geom_bin2d()` creates a tessellation and counts the number of events
# in each of the "tiles" in the tessellation. It then assigns colors based 
# on the count of events. The `binwidth` determines the size of the squares 
# in the tessellation, in this case squares of size 1 by 1...which corresponds 
# to the size of the region!
ggplot() +
  geom_bin2d(data = subset(PointPatterns, 
                           Pattern == "Pattern 1"),
             aes(x = x, 
                 y = y),
             binwidth = c(1, 1)) +
  coord_fixed()

Let us see step-by-step how this plot is made.

  1. ggplot() creates a plot object.
  2. geom_bin2d is called to plot a map of counts of events in the space defined by the bins.
  3. The dataframe used for plotting the bins is PointPatterns, subset so that only the points in “Pattern 1” are used.
  4. The coordinates x and y are used to plot (in aes(), we indicate that x in the dataframe corresponds to the x axis in the plot, and y in the dataframe corresponds to y axis in the plot)
  5. The size of the bin is defined as 1-by-1 (binwidth = c(1, 1))
  6. coord_fixed is applied to ensure that the aspect ratio of the plot is one (one unit of x is the same length as one unit of y in the plot).

The map of the overall density of the process above is not terribly interesting. It only reports what we already knew, that globally the density of the point pattern is 60. It would be more interesting to see how the density varies across the region.

We do this by means of the concept of quadrats.

Imagine that instead of calculating the overall (or global) intensity of the point pattern, we subdivided the region into a set of smaller subregions. For instance, we could draw horizontal and vertical lines to create smaller squares:

# `geom_vline()` draws vertical lines that cross the x-axis at 
# the points indicated; `geom_hline()` draws horizontal lines 
# that cross the y-axis at the points indicated
ggplot() + 
  geom_vline(xintercept = seq(from = 0, 
                              to = 1, 
                              by = 0.25)) +
  geom_hline(yintercept = seq(from = 0, 
                              to = 1, 
                              by = 0.25)) +
  geom_point(data = filter(PointPatterns, 
                           Pattern == "Pattern 1"), 
             aes(x = x, 
                 y = y)) +
  coord_fixed()

Notice how we used to create the vertical lines (geom_vline) and horizontal lines (geom_hline), from 0 to 1 every 0.25 units of distance respectively. This creates a tessellation that divides the original region into 16 smaller squares, or subregions. Each of the smaller squares used to subdivide the region is called a quadrat.

To make things more interesting, instead of calculating the overall density, we can calculate the density for each quadrat. Now the size of the quadrats will be \(0.25\times 0.25\). Here we visualize the density of the quadrats:

ggplot() +
  geom_bin2d(data = filter(PointPatterns, 
                           Pattern == "Pattern 1"),
             aes(x = x, 
                 y = y),
             binwidth = c(0.25, 
                          0.25)) +
    geom_point(data = filter(PointPatterns, 
                             Pattern == "Pattern 1"), 
             aes(x = x, 
                 y = y)) +
  scale_fill_distiller(palette = "RdBu") +
  coord_fixed()

You can, of course, change the size of the quadrats. We can take a look at the four point patterns (by means of faceting), after creating a variable to easily control the size of the quadrat. Let us call this variable q_size:

# `q_size` controls the size of the quadrats; experiment changing this parameter
q_size <- 0.5
ggplot() +
  geom_bin2d(data = PointPatterns,
             aes(x = x, 
                 y = y),
             binwidth = c(q_size, 
                          q_size)) +
  geom_point(data = PointPatterns, 
             aes(x = x, 
                 y = y)) +
  facet_wrap(~ Pattern) +
  scale_fill_distiller(palette = "RdBu") +
  coord_fixed()

Notice the differences in the density maps? Try changing the size of the quadrat to 1. What happens, and why? Next, try a smaller quadrat size, say 0.25. What happens, and why? Try even smaller quadrat sizes, but greater than zero. What happens now?

The package spatstat (Baddeley, Rubak, and Turner 2016) includes numerous functions for the analysis of point patterns. A relevant function for us at this stage, is quadratcount(), which returns the number of events per quadrat.

To use this function, we need to convert the point patterns to a type of object used by spatstat denominated ppp (for plannar point pattern). This is simple, thanks to a utility function in spatstat called as.ppp. This function takes as arguments (inputs) a set of coordinates, and data to define a window. To benefit from the functionality of spatstat we will convert our data frame with spatial patterns into ppp objects.

First, define the window by means of the owin function, and using the 0 to 1 interval for our region:

# `owin()` creates a window for `ppp` objects, which becomes 
# the _region_ under study. Here, we define a window that is 
# the unit square and we will discuss the importance of an 
#appropriate definition of the region later. The windows in 
# `spatstat` need not be squares or rectangles, and can actually 
# be irregular shapes
Wnd <- owin(c(0,1), c(0,1)) 

Now, a ppp object can be created:

# `as.ppp()` will take an object and convert it to a `ppp` object.
# Here, it does a fairly good job of guessing the contents of the 
# data frame! The second argument to create the `ppp` object is a 
# window, that is, an `owin` object
ppp1 <- as.ppp(PointPatterns, 
               Wnd)

If you examine these new ppp objects, you will see that they pack the same basic information (i.e., the coordinates), but also the range of the region and so on:

summary(ppp1)
## Marked planar point pattern:  240 points
## Average intensity 240 points per square unit
## 
## Coordinates are given to 8 decimal places
## 
## Multitype:
##           frequency proportion intensity
## Pattern 1        60       0.25        60
## Pattern 2        60       0.25        60
## Pattern 3        60       0.25        60
## Pattern 4        60       0.25        60
## 
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit

As you can see, the ppp object includes the four patterns, calculates the frequency of each (the number of events), and their respective overall intensities.

Objects of the class ppp can be plotted using base R plotting functions:

plot(ppp1)

To plot each pattern separately we can split the different patterns using the function split.ppp(). Notice how $ works for indexing the patterns here, just as it does for indexing columns in a data frame:

plot(split.ppp(ppp1)$`Pattern 1`)

Once the patterns are in ppp form, quadratcount can be used to compute the counts of events. To calculate the count separately for each pattern, you need to use again split.ppp() (if you don’t index a pattern, it will apply the function to all of them). The other two arguments are the number of quadrats in the horizontal (nx) and the vertical (ny) directions:

quadratcount(split(ppp1),
             nx = 4,
             ny = 4)
## List of spatial objects
## 
## Pattern 1:
##             x
## y            [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1]
##   [0.75,1]          3          5          1        6
##   [0.5,0.75)        2          3          4        6
##   [0.25,0.5)        5          4          2        3
##   [0,0.25)          2          4          4        6
## 
## Pattern 2:
##             x
## y            [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1]
##   [0.75,1]         14          2          2        6
##   [0.5,0.75)        0          0          4        6
##   [0.25,0.5)        6          3          1        2
##   [0,0.25)          4          6          2        2
## 
## Pattern 3:
##             x
## y            [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1]
##   [0.75,1]          2         11          5        7
##   [0.5,0.75)        1          1          6        4
##   [0.25,0.5)        1         10          3        2
##   [0,0.25)          2          1          2        2
## 
## Pattern 4:
##             x
## y            [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1]
##   [0.75,1]          4          5          6        3
##   [0.5,0.75)        3          3          4        2
##   [0.25,0.5)        3          3          4        2
##   [0,0.25)          5          4          6        3

Compare the counts of the quadrats for each pattern. They should replicate what you observed in the density plots before.

9.8 Defining the Region for Analysis

It is important when conducting the type of analysis described above (and more generally any analysis with point patterns), to define a region for analysis that is consistent with the pattern of interest.

Consider for instance what would happen if the region was defined, instead of in the unit square, as a bigger region. Create a second window:

# This new window measure 3 units in the x-axis, 
# and also 3 units in the y-axis (from -1 to 2)
Wnd2 <- owin(c(-1,2), 
             c(-1,2)) 

Create a second ppp object using this new window:

# Here, we use the same events as before, but place them in the larger window we just created
ppp2 <- as.ppp(PointPatterns,
               Wnd2)

Repeat the plot but using the new ppp object:

plot(split.ppp(ppp2)$`Pattern 1`)

Repeat but now using an even bigger region. Create a third window:

Wnd3 <- owin(c(-2, 3), 
             c(-2, 3)) 

And also a third ppp object using the third window:

ppp3 <- as.ppp(PointPatterns, 
               Wnd3)

Now the plot looks like this:

plot(split.ppp(ppp3)$`Pattern 1`)

Which of the three regions that you saw above is more appropriate? What do you think is the effect of selecting an inappropriate region for the analysis?

This concludes this chapter. The next activity will illustrate how quadrats are a useful tool to explore the question whether a map is random.

10 Activity 4: Point Pattern Analysis I

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

10.1 Practice questions

Answer the following questions:

  1. What is a random process?
  2. What is a deterministic process?
  3. What is a stochastic process?
  4. What is a pattern?
  5. What is the usefulness of a null landscape?

10.2 Learning objectives

In this activity, you will:

  1. Use the concept of quadrats to analyze a real dataset.
  2. Learn about a quadrat-based test for randomness in point patterns.
  3. Learn how to use the p-value of a statistical test to make a decision.
  4. Think about the distribution of events in a null landscape.
  5. Think about ways to decide whether a landscape is random.

10.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

10.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here):

library(isdas)
library(maptools) # Needed to convert a `Spatial Polygons` object into an `owin` object
## Warning: package 'sp' was built under R version 4.2.2
library(sf)
library(spatstat)
library(tidyverse)

In the practice that preceded this activity, you learned about the concepts of intensity and density, about quadrats, and also how to create density maps. Begin by loading the data that you will use in this activity:

data("Fast_Food")
data("Gas_Stands")
data("Paez_Mart")

Next the geospatial files need to be read. For this example, the city boundary of Toronto is provided in two different formats, as a dataframe (which can be used to plot using ggplot2) and as a SpatialPolygons object, a format widely used in R for spatial analysis. The :

data("Toronto")

If you inspect your workspace, you will see that the following dataframes are there:

  • Fast_Food
  • Gas_Stands
  • Paez_Mart

These are locations of a selection of fast food restaurants, and also of gas stands in Toronto (data are from 2008). Paez Mart on the other hand is a project to cover Toronto with convenience stores. The points are the planned locations of the stores.

Also, there should be an object of class sf. This dataframe contains the city boundary of Toronto:

class(Toronto)
## [1] "sf"         "data.frame"

Try plotting the following:

ggplot() +
  geom_sf(data = Toronto, color = "black", fill = NA, alpha = 1, size = .3) +
  geom_sf(data = Paez_Mart) +
  coord_sf()

As discussed in the preceding chapter, the package spatstat offers a very rich collection of tools to do point pattern analysis. To convert the three sets of events (i.e., the fast food establishments, gas stands, and Paez Mart) into ppp objects we first must define a region or window. To do this we take the sf and convert to an owin (a window object) for use with the package spatstat (this is done via SpatialPolygons, hence as(x, "Spatial"):

# `as.owin()` will take a "foreign" object (foreign to `spatstat`) and convert it into an `owin` object. Here, there are two steps involved: first, we take the `sf` object with the boundaries of Toronto and convert it into a "Spatial" object, and then the "Spatial" object is passed on to `as.owin()`
Toronto.owin <- as(Toronto, "Spatial") %>% as.owin() # Requires `maptools` package

And, then convert the dataframes to ppp objects (this necessitates that we extract the coordinates of the events by means of st_coordinates):

Fast_Food.ppp <- as.ppp(st_coordinates(Fast_Food), W = Toronto.owin)
Gas_Stands.ppp <- as.ppp(st_coordinates(Gas_Stands), W = Toronto.owin)
Paez_Mart.ppp <- as.ppp(st_coordinates(Paez_Mart), W = Toronto.owin)

These objects can now be used with the functions of the spatstat package. For instance, you can calculate the counts of events by quadrat by means of quadrat.count. The input must be a ppp object, and the number of quadrats on the horizontal (nx) and vertical (ny) direction (notice how I use the function table to present the frequency of quadrats with number of events):

q_count <- quadratcount(Fast_Food.ppp, nx = 3, ny = 3)
table(q_count)
## q_count
##   0   6  44  48  60  64  85 144 163 
##   1   1   1   1   1   1   1   1   1

As you see from the table, there is one quadrat with zero events, one quadrat with six events, one quadrat with forty-four events, and so on.

You can also plot the results of the quadratcount() function!

plot(q_count)

A useful function in the spatstat package is quadrat.test. This function implements a statistical test that compares the empirical distribution of events by quadrats to the distribution of events as expected under the hypothesis that the underlying process is random.

This is implemented as follows:

q_test <- quadrat.test(Fast_Food.ppp, nx = 3, ny = 3)
## Warning: Some expected counts are small; chi^2 approximation may be inaccurate
q_test
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  Fast_Food.ppp
## X2 = 213.74, df = 8, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## Quadrats: 9 tiles (irregular windows)

The quadrat test reports a \(p\)-value which can be used to make a decision. The \(p\)-value is the probability that you will be mistaken if you reject the null hypothesis. To make a decision, you need to know what is the null hypothesis, and your own tolerance for making a mistake. In the case above, the \(p\)-value is very, very small (2.2e-16 = 0.00000000000000022). Since the null hypothesis is spatial randomness, you can reject this hypothesis and the probability that this decision is mistaken is vanishingly small.

Try plotting the results of quadrat.test:

plot(q_test)

Now that you have seen how to do some analysis using quadrats, you are ready for the next activity.

10.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Use Fast_Food, Gas_Stands, Paez_Mart, and Toronto to create density maps for the three point patterns. Select a quadrat size that you think is appropriate.

  2. Use Fast_Food.ppp, Gas_Stands, and Paez_Mart, and the function quadratcount to calculate the number of events per quadrat. Remember that you need to select the number of quadrats in the horizontal and vertical directions!

  3. Use the function table() to examine the frequency of events per quadrat for each of the point patterns.

Activity Part II

  1. Show your density maps to a fellow student. Did they select the same quadrat size? If not, what was their rationale for their size?

  2. Again, use the function table() to examine the frequency of events per quadrat for each of the point patterns. What are the differences among these point patterns? What would you expect the frequency of events per quadrat to be in a null landscape?

  3. Use Fast_Food.ppp, Gas_Stands, and Paez_Mart, and the function quadrat.test to calculate the test of spatial independence for these point patterns. What is your decision in each case? Explain.

11 Point Pattern Analysis II

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In the last practice/session your learning objectives included:

  1. A formal definition of point pattern.
  2. Processes and point patterns.
  3. The concepts of intensity and density.
  4. The concept of quadrats and how to create density maps.
  5. More ways to control the look of your plots, in particular faceting and adding lines.

Please review the previous practices if you need a refresher on these concepts.

11.1 Learning Objectives

In this practice, you will learn:

  1. The intuition behind the quadrat-based test of independence.
  2. About the limitations of quadrat-based analysis.
  3. The concept of kernel density.
  4. More ways to manipulate objects to do point pattern analysis using spatstat.

11.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex.
  • Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 6. CRC: Boca Raton.
  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

11.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(spatstat)
library(tidyverse)

Load the datasets that you will use for this practice:

data("PointPatterns")
data("pp0_df")

PointPatterns is a data frame with four sets of spatial events, labeled as “Pattern 1”, “Pattern 2”, “Pattern 3”, and “Pattern 4”. Each set has \(n=60\) events. You can check the class of this object by means of the function class class().

class(PointPatterns)
## [1] "data.frame"

The second data frame (i.e., pp0_df) includes the coordinates x and y of two sets of spatial events, labeled as “Pattern 1” and “Pattern 2”.

The summary for PointPatterns shows that these point patterns are located in a square-unit window (check the max and min values of x and y):

summary(PointPatterns)
##        x                y                 Pattern  
##  Min.   :0.0169   Min.   :0.005306   Pattern 1:60  
##  1st Qu.:0.2731   1st Qu.:0.289020   Pattern 2:60  
##  Median :0.4854   Median :0.550000   Pattern 3:60  
##  Mean   :0.5074   Mean   :0.538733   Pattern 4:60  
##  3rd Qu.:0.7616   3rd Qu.:0.797850                 
##  Max.   :0.9990   Max.   :0.999808

The same is true for pp0_df:

summary(pp0_df)
##        x                y                 marks   
##  Min.   :0.0456   Min.   :0.03409   Pattern 1:36  
##  1st Qu.:0.2251   1st Qu.:0.22963   Pattern 2:36  
##  Median :0.4282   Median :0.43363                 
##  Mean   :0.4916   Mean   :0.47952                 
##  3rd Qu.:0.7812   3rd Qu.:0.77562                 
##  Max.   :0.9564   Max.   :0.94492

As seen in the previous practice and activity, the package spatstat employs a type of object called ppp (for planar point pattern). Fortunately, it is relatively simple to convert a data frame into a ppp object by means of as.ppp(). This function requires that you define a window for the point pattern, something we can do by means of the owin function:

# "W" will appear in your environment as a defined window with boundaries of (1,1)
W <- owin(xrange = c(0, 1),
          yrange = c(0, 1))

Then the data frames are converted using the as.ppp function:

# Converts the data frame to planar point pattern using the defined window "W"
pp0.ppp <- as.ppp(pp0_df, 
                  W = W)
PointPatterns.ppp <- as.ppp(PointPatterns, W = W)

You can verify that the new objects are indeed of ppp-class:

#"class" is an excellent tool to use when verifying the type of data object 
class(pp0.ppp)
## [1] "ppp"
class(PointPatterns.ppp)
## [1] "ppp"

11.4 A Quadrat-based Test for Spatial Independence

In the preceding activity, you used a quadrat-based spatial independence test to help you decide whether a pattern was random (the function was quadrat.test). We will now review the intuition of the test.

Let’s begin by plotting the patterns. You can use split to do plots for each pattern separately, instead of putting all of them in a single plot (this approach is not as refined as ggplot2, where we have greater control of the aspect of the plots; on the other hand, it is quick):

#The split functions separates without defining a window. 
# This is a quicker option to get relative results
plot(split(PointPatterns.ppp))

Recall that you can also plot individual patterns by using $ followed by the factor that identifies the desired pattern (this is a way of indexing different patterns in ppp-class objects):

# Using "$" acts as a call sign to retrieve information from a data frame. 
# In this case, you are calling "Pattern 4" from "PointPatterns.ppp"
plot(split(PointPatterns.ppp)$"Pattern 4")

Now calculate the quadrat-based test of independence:

# `quadrat.test()` generates a quadrat-based test of independence, in this case, 
# for "Pattern 2" called from "PointPatterns.ppp", using 3 quadrats in the direction 
# of the x-axis and 3 quadrats in the direction of the y-axis 
q_test <- quadrat.test(split(PointPatterns.ppp)$"Pattern 2", 
                       nx = 3, 
                       ny = 3)
q_test
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  split(PointPatterns.ppp)$"Pattern 2"
## X2 = 48, df = 8, p-value = 1.976e-07
## alternative hypothesis: two.sided
## 
## Quadrats: 3 by 3 grid of tiles

Plot the results of the quadrat test:

plot(q_test)

As seen in the preceding chapter, the expected distribution of events on quadrats under the null landscape tends to be quite even. This is because each quadrat has equal probability of having the same number of events (depending on size, when the quadrats are not all the same size the number will be proportional to the size of the quadrat).

If you check the plot of the quadrat test above, you will notice that the first number (top left corner) is the number of events in the quadrat. The second number (top right corner) is the expected number of events for a null landscape. The third number is a residual, based on the difference between the observed and expected number of events. More specifically, the residual is a Pearson residual, defined as follows: \[ r_i=\frac{O_i - E_i}{\sqrt{E_i}}, \] where \(O_i\) is the number of observed events in quadrat \(i\) and \(E_i\) is the number of expected events in quadrat \(i\). When the number of observed events is similar to the number of expected events, \(r_i\) will tend to be a small value. As their difference grows, the residual will also grow.

The independence test is calculated from the residuals as: \[ X^2=\sum_{i=1}^{Q}r_i^2, \] where \(Q\) is the number of quadrats. In other words, the test is based on the squared sum of the Pearson residuals. The smaller this number is, the more likely that the observed pattern of events is not different from a null landscape (i.e., a random process), and the larger it is, the more likely that it is different from a null landscape. This is reflected by the \(p\)-value of the test (technically, the \(p\)-value is obtained by comparing the test to the \(\chi^2\) distribution, pronounced “kay-square”).

Consider for instance the first pattern in the examples:

plot(quadrat.test(split(PointPatterns.ppp)$"Pattern 1", 
                  nx = 3, 
                  ny = 3))

You can see that the Pearson residual of the top left quadrat is indeed -0.6567673, the next to its right is -0.2704336, and so on. The value of the test statistic should be then:

# The "Paste" function joins together several arguments as characters. 
# Here, this is a string of values for "X2", where X2" is the squared 
# sum of the residuals

paste("X2 = ", 
      (-0.65)^2 + (-0.26)^2 + (0.52)^2 + 
        (-0.26)^2 + (0.9)^2 + (0.52)^2 + 
        (-1)^2 + (0.13)^2 + (0.13)^2)
## [1] "X2 =  2.9423"

Which you can confirm by examining the results of the test (the small difference is due to rounding errors):

quadrat.test(split(PointPatterns.ppp)$"Pattern 1", 
             nx = 3, 
             ny = 3)
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  split(PointPatterns.ppp)$"Pattern 1"
## X2 = 3, df = 8, p-value = 0.1313
## alternative hypothesis: two.sided
## 
## Quadrats: 3 by 3 grid of tiles

Explore the remaining patterns. You will notice that the residuals and test statistic tend to grow as more events are concentrated in space. In this way, the test is a test of density of the quadrats: is their density similar to what would be expected from a null landscape?

11.5 Limitations of Quadrat Analysis: Size and Number of Quadrats

As hinted by the previous activity, one issue with quadrat analysis is the selection of the size for the quadrats. Changing the size of the quadrats has an impact on the counts, and in turn on the aspect of density plots and even the results of the test of independence.

For example, the results of the test for “Pattern 2” in the dataset change when the number of quadrats is modified. For instance, with a small number of quadrats:

quadrat.test(split(PointPatterns.ppp)$"Pattern 2", 
             nx = 2, 
             ny = 1)
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  split(PointPatterns.ppp)$"Pattern 2"
## X2 = 1.6667, df = 1, p-value = 0.3934
## alternative hypothesis: two.sided
## 
## Quadrats: 2 by 1 grid of tiles

Compare to four quadrats:

quadrat.test(split(PointPatterns.ppp)$"Pattern 2", 
             nx = 2, 
             ny = 2)
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  split(PointPatterns.ppp)$"Pattern 2"
## X2 = 6, df = 3, p-value = 0.2232
## alternative hypothesis: two.sided
## 
## Quadrats: 2 by 2 grid of tiles

And:

quadrat.test(split(PointPatterns.ppp)$"Pattern 2", 
             nx = 3, 
             ny = 2)
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  split(PointPatterns.ppp)$"Pattern 2"
## X2 = 23.2, df = 5, p-value = 0.0006182
## alternative hypothesis: two.sided
## 
## Quadrats: 3 by 2 grid of tiles

Why is the statistic generally smaller when there are fewer quadrats?

A different issue emerges when the number of quadrats is large:

quadrat.test(split(PointPatterns.ppp)$"Pattern 2", 
             nx = 4, 
             ny = 4)
## Warning: Some expected counts are small; chi^2 approximation may be inaccurate
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  split(PointPatterns.ppp)$"Pattern 2"
## X2 = 47.2, df = 15, p-value = 6.84e-05
## alternative hypothesis: two.sided
## 
## Quadrats: 4 by 4 grid of tiles

A warning now tells you that some expected counts are small: space has been divided so minutely, that the expected number of events per quadrat has become too thin; as a consequence, the approximation to the probability distribution may be inaccurate.

While there are no hard rules to select the size/number of quadrats, the following rules of thumb are sometimes suggested:

  1. Each quadrat should have a minimum of two events.
  2. The number of quadrats is selected based on the area (A) of the region, and the number of events (n): \[ Q=\frac{2A}{N} \] Caution should be exercised when interpreting the results of the analysis based on quadrats, due to the issue of size/number of quadrats.

11.6 Limitations of Quadrat Analysis: Relative Position of Events

Another issue with quadrat analysis is that it is not sensitive to the relative position of the events within the quadrats.

Consider for instance the following two patterns in pp0:

plot(split(pp0.ppp))

These two patterns look quite different. And yet, when we count the events by quadrats:

plot(quadratcount(split(pp0.ppp), 
                  nx = 3, 
                  ny = 3))

This example highlights how quadrats are relatively coarse measures of density, and fail to distinguish between fairly different event distributions, in particular because quadrat analysis does not take into account the relative position of the events with respect to each other.

11.7 Kernel Density

In order to better take into account the relative position of the events with respect to each other, a different technique can be devised.

Imagine that a quadrat is a kind of “window”. We use it to observe the landscape. When we count the number of events in a quadrat, we simply peek through that particular window: all events inside the “window” are simply counted, and all events outside the “window” are ignored. Then we visit another quadrat and do the same, until we have visited all quadrats.

Imagine now that we define a window that, unlike the quadrats which are fixed, can move and visit different points in space. This window also has the property that, instead of counting the events that are in the window, it gives greater weight to events that are close to the center of the window, and less weight to events that are more distant from the center of the window.

We can define such a window by selecting a function that declines with increasing distance. We will call this function a kernel. An example of a function that can work as a moving window is the following.

# Here we create a data.frame to use for plotting; it includes a single column 
# with a variable called `dist` for distance, that varies between -3 and 3; 
# the function `stat_function()` is used in `ggplot2` to transform an input 
# by means of a function, which in this case is `dnorm` the normal distribution!
# `ylim()` sets the limits of the plot in the y-axis 
ggplot(data = data.frame(dist = c(-3, 3)), 
       aes(dist)) +
  stat_function(fun = dnorm, 
                n = 101, 
                args = list(mean = 0, 
                            sd = 1)) +
  ylim(c(0, 0.45))

As you can see, the value of the function declines with increasing distance from the center of the window (when dist == 0; note that the value never becomes zero!). Since we used the normal distribution, this is a Gaussian kernel. The shape of the Gaussian kernel depends on the standard deviation, which controls how “big” the window is, or alternatively, how quickly the function decays. We will call the standard deviation the kernel bandwidth of the function.

Since the bandwidth controls how rapidly the weight assigned to distant events decays, if the argument changes, so will the shape of the kernel function. As an experiment, change the value of the argument sd in the chunk above. You will see that as it becomes smaller, the slope of the kernel becomes steeper (and distant observations are downweighted more rapidly). On the contrary, as it becomes larger, the slope becomes less steep (and distant events are weighted almost as highly as close events).

Kernel density estimates are usually obtained by creating a fine grid that is superimposed on the region. The kernel function then visits each point on the grid and obtains an estimate of the density by summing the weights of all events as per the kernel function.

Kernel density is implemented in spatstat and can be used as follows.

The input is a ppp object, and optionally a sigma argument that corresponds to the bandwidth of the kernel:

# The "density" function computes estimates of kernel density. Here we are creating 
# a Kernel Density estimate using "pp0.ppp" from our data frame by means of a 
# bandwidth defined by "sigma"

kernel_density <- density(split(pp0.ppp), 
                          sigma = 0.1)
plot(kernel_density)

Compare to the distribution of events:

plot(split(pp0.ppp))

It is important to note that the gradation of colors is different in the two kernel density plots. Whereas the smallest value in the plot on the left is less than 20 and the largest is greater than 100, on the other plot the range is only between 45 to approximately 50. Thus, the intensity of the process is much higher at places in Pattern 1 that in Pattern 2.

The plots above illustrate how the map of the kernel density is better able to capture the variations in density across the region. In fact, kernel density is a smooth estimate of the underlying intensity of the process, and the degree of smoothing is controlled by the bandwidth.

12 Activity 5: Point Pattern Analysis II

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

12.1 Practice questions

Answer the following questions:

  1. How does the quadrat-based test of independence respond to a small number of quadrats?
  2. How does the quadrat-based test of independence respond to a large number of quadrats?
  3. What are the limitations of quadrat analysis?
  4. What is a kernel function?
  5. How does the bandwidth affect a kernel function?

12.2 Learning objectives

In this activity, you will:

  1. Explore a dataset using quadrats and kernel density.
  2. Experiment with different parameters (number/size of kernels and bandwidths).
  3. Discuss the impacts of selecting different parameters.
  4. Hypothesize about the underlying spatial process based on your analysis.

12.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

12.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the worspace.

Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here):

library(isdas)
library(spatstat)
library(tidyverse)

In the practice that preceded this activity, you learned about the concepts of intensity and density, about quadrats, and also how to create density maps. Begin by loading the data that you will use in this activity:

data("bear_df")

This dataset was sourced from the Scandinavia Bear Project, a Swedish-Noruegian collaboration that aims to study the ecology of brown bears, to provide decision makers with evidence to support bear management, and to provide information regarding bears to the public. You can learn more about this project here.

The project involves tagging bears with GPS units, so that their movements can be tracked.

The dataset includes coordinates of one bear’s movement over a period of several weeksin 2004. The dataset was originally taken from the adehabitatLT package but was somewhat simplified for this activity. Instead of full date and time information, the point pattern is marked more simply as “Day Time” and “Night Time”, to distinguish between diurnal and nocturnal activity of the bear.

Summarize the contents of this dataframe:

summary(bear_df)
##        x                y                  marks    
##  Min.   :515743   Min.   :6812138   Day Time  :502  
##  1st Qu.:518995   1st Qu.:6813396   Night Time:498  
##  Median :519526   Median :6816724                   
##  Mean   :519321   Mean   :6816474                   
##  3rd Qu.:519983   3rd Qu.:6818111                   
##  Max.   :522999   Max.   :6821440

The Min. and Max. of x and y give us an idea of the region covered by this dataset. We can use these values to approximate a window for the region (as an experiment, you could try changing these values to create regions of different sizes):

W <- owin(xrange = c(515000, 523500), yrange = c(6812000, 6822000))

Next, we can convert the dataframe into a ppp-class object suitable for analysis using the package spatstat:

bear.ppp <- as.ppp(bear_df, W = W)

You can check the contents of the ppp object by means of summary:

summary(bear.ppp)
## Marked planar point pattern:  1000 points
## Average intensity 1.176471e-05 points per square unit
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
## 
## Multitype:
##            frequency proportion    intensity
## Day Time         502      0.502 5.905882e-06
## Night Time       498      0.498 5.858824e-06
## 
## Window: rectangle = [515000, 523500] x [6812000, 6822000] units
##                     (8500 x 10000 units)
## Window area = 8.5e+07 square units

Now that you have loaded the dataframe and converted to a ppp object, you are ready for the next activity.

12.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Analyze the point pattern for the movements of the bear using quadrat and kernel density methods. Experiment with different quadrat sizes and kernel bandwidths.

Activity Part II

  1. Explain your choice of parameters (quadrat sizes and kernel bandwidths) to a fellow student.

  2. Decide whether these patterns are random, and support your decision.

  3. Do you see differences in the activity patterns of the bear by time of day? What could explain those differences, if any?

  4. Discuss the limitations of your conclusions, and of quadrat/kernel (density-based) approaches more generally.

13 Point Pattern Analysis III

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In the last practice/session your learning objectives included:

  1. The intuition behind the quadrat-based test of independence.
  2. The concept of kernel density.
  3. The limitations of density-based analysis
  4. More ways to work with ppp objects.

If you wish to work interactively with this chapter you will need the following:

  • An R markdown notebook version of this document (the source file).

  • A package called isdas.

13.1 Learning Objectives

In this practice, you will learn:

  1. About clustered and dispersed (or regular) patterns.
  2. The concept of nearest neighbors.
  3. About distance-based methods for point pattern analysis.
  4. About the G-function for the analysis of event-to-event nearest neighbor distances.

13.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex.
  • Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 8. CRC: Boca Raton.
  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

13.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(spatstat)
library(tidyverse)

Load the dataset that you will use for this practice:

data("pp0_df")

Examine the contents of the data frame you just loaded:

summary(pp0_df)
##        x                y                 marks   
##  Min.   :0.0456   Min.   :0.03409   Pattern 1:36  
##  1st Qu.:0.2251   1st Qu.:0.22963   Pattern 2:36  
##  Median :0.4282   Median :0.43363                 
##  Mean   :0.4916   Mean   :0.47952                 
##  3rd Qu.:0.7812   3rd Qu.:0.77562                 
##  Max.   :0.9564   Max.   :0.94492

As you can see, this data frame includes a set of coordinates for two point patterns, labeled “Pattern 1” and “Pattern 2”, each of which consists of \(n=36\) events. The range of the coordinates (between 0 and 1) suggests a window as follows:

# Remember, `owin()` is used to create a window to frame 
# a point pattern in the package `spatstat`
W <- owin(c(0,1), 
          c(0,1))

This creates an owin object that defines a region in the unit square.

Given window object W, it is possible to transform the dataframe into a ppp object:

# Remember, `as.ppp()` will take a foreign object (foreign 
# to `spatstat`) and convert it into a `ppp` object
pp0.ppp <- as.ppp(pp0_df, 
                  W = W)

If you need a refresher on how to create ppp objects see Chapter @ref(point-pattern-analysis-i)

13.4 Motivation

Quadrats and kernel density are examples of density-based analysis. These techniques are useful to help you understand variations in the distribution of events at a relatively large scale, but as previously discussed, may sometimes be less informative by not taking into account small scale variations in the locations of the events.

For this reason, the following two patterns, despite being very different, give identical number of counts per quadrat:

# The `split()` function is used to divide data in the vector 
# into groups using a categorical variable; in this case, the 
# `ppp` object includes only the coordinates and a variable that 
# identifies the coordinates as belonging to "Pattern 1" or 
# "Pattern 2". For this reason, the split is accomplished according 
# to this variable
plot(split(pp0.ppp))

# Arguments `nx` and `ny` indicate the number of quadrats on the 
# x and y directions respectively
plot(quadratcount(split(pp0.ppp), 
                  nx = 3, 
                  ny = 3))

The two patterns above have similar density, However, “Pattern 1” displays clustering, a situation characterized by events generally being in close proximity to others. “Pattern 2”, on the other hand, displays dispersion or regularity, a situation where points tend to be located at similar distances from each other.

With some fiddling of the parameters, quadrats can be coaxed to tease out the variations in density, for instance:

plot(quadratcount(split(pp0.ppp), 
                  nx = 9, 
                  ny = 9))

As a visualization technique, this gives a better sense of the variations in density. However, as noted previously, the quality of the test of independence deteriorates when there are many quadrats with small counts.

As an alternative, kernel density can be used to visualize the smoothed estimate of the density:

plot(density(split(pp0.ppp), 
             sigma = 0.075))

However, even when we can visualize the variations in density, we cannot, from the kernel estimate alone, tell if high/low values exceed those of a null landscape - in other words, we lack at the moment a way to test the hypothesis that the density is higher than what would be expected from a null landscape.

In this practice you will learn about a family of techniques that instead of measuring the density, explore patterns by means of distance distributions.

13.5 Nearest Neighbors

Let us begin by introducing the concept of a nearest neighbor.

The nearest neighbor of a location is the event that is closest to said location given some metric. This metric is usually Euclidean distance on the plane, that is, distance as measured using a straight line between the location and the event. In principle, the metric can be selected according to the characteristics of a dataset: this could be Euclidean distance, great circle distance, or network distance, for events on networks.(see Figure @ref(fig:distance-metrics)).

\label{fig:distance-metrics}Examples of distance metrics

(#fig:distance-metrics)Examples of distance metrics

In this way, the nearest neighbor of \(i\) is the event \(j\) with the shortest distance \(d\) from location \(i\): \[ \text{Event }j\text{ is the nearest neighbor of location }i\text{ if: }d_{ij}\le d_{ik} \forall k \]

Ties are relatively rare in most realistic point patterns (even in regular patterns), and may not have a big impact on the analysis.

The package spatstat includes functions to calculate Euclidean distances. Three functions are relevant:

  • pairdist(): returns the pairwise distance between all pairs of events i and j.

  • nndist(): returns a vector of distances from events to to their corresponding nearest neighbors; these distances are obtained by sorting the pairwise distances, and selecting the minimum value for each event.

  • distmap(): returns a pixel image with the distance from each pixel to the nearest event; in effect this is a map of the distances between empty spaces and their corresponding nearest events.

With these functions we can calculate, for instance, the following distances:

# Function `nndist()` will calculate the distance of each 
# event to its nearest neighbor
pp0_nn1 <- nndist(split(pp0.ppp)$"Pattern 1")

The value of nndist() is a vector with \(n\) distances, where \(n\) is the number of events in the pattern. The first distance in the vector is the distance from the first event in the series to its nearest neighbor, the second is the distance from the second event in the series to its nearest neighbor, and so on.

Let us explore the distribution of these distances by means of a histogram:

# Remember, `geom_histogram()` adds a histogram to a `ggplot2` object; 
# the `binwidth` argument defines the size of each bin for the histogram
ggplot(data = data.frame(dist = pp0_nn1), 
       aes(dist)) + 
  geom_histogram(binwidth = 0.03)

Notice how most events (20 out of 36) have a nearest neighbor at a relatively short distance (<0.05). What does this mean?

Compare to the distribution of distances in “Pattern 2” of pp0.ppp:

# Calculate the distances to nearest neighbors in the second point
# pattern, i.e., "Pattern 2"
pp0_nn2 <- nndist(split(pp0.ppp)$"Pattern 2")

# Create a histogram to explore the distribution of values of 
# distances to nearest neighbors
ggplot(data = data.frame(dist = pp0_nn2), 
       aes(dist)) + 
  geom_histogram(binwidth = 0.03)

In this case, most events (more than 30 out of 36) have a nearest neighbor at a distance of approximately 0.15. What does this mean?

The two histograms above are interesting in that they reveal, for “Point Pattern 1” that most events are only a short distance away from another event (indicative of clustering), whereas for “Point Pattern 2” the suggestion is that almost all events have a nearest neighbor at a distance that is constant (indicative of regularity). However, the histograms do not convey more spatial information. Another useful tool to explore the distribution of distances to nearest neighbors is a Stienen diagram. A Stienen diagram is essentially a proportional symbol plot of the events. The sizes of symbols are proportional to the distance to their nearest neighbor. For example, for “Pattern 1” in pp0.ppp (Notice the use of %mark% to add an attribute to the ppp object; the attribute is the distance to the nearest neighbor):

# The function %mark% is used to add a variable (a "mark") to a `ppp` object. In this example, the variable we are adding to "Pattern 1" is the distance from the event to its nearest neighbor, as calculated above
split(pp0.ppp)$"Pattern 1" %mark% (pp0_nn1) %>%
  plot(markscale = 1, main = "Stienen diagram")

In this diagram, the largest circle is not very large: even events that are relatively isolated are not a long distance away from their nearest neighbor. This fits the definition of clustering as a situation where events tend to be relatively close to each other.

Compare to the Stienen diagram of “Pattern 2”:

split(pp0.ppp)$"Pattern 2" %mark% (pp0_nn2) %>%
plot(markscale = 1, main = "Stienen diagram")

Notice how all circles are very similar in size: this fits the definition of dispersion, where events are more or less equally distant from their nearest neighbors.

What would these diagrams look for a null landscape? We can use the function runifpoint from the spatstat package to generate a null landscape:

# `runifpoint()` is a function to generate random coordinates based on the uniform random distribution function. The argument tells the function to create n = 36 random coordinates for our null landscape; this null landscape is contained in the window `W`, same as our previous point patterns  
rand_ppp <- runifpoint(n = 36, win = W)

If we plot the Stienen diagram for this point pattern:

# Calculate the distances to nearest neighbors for the null landscape
rand_nn <- nndist(rand_ppp)

# Add the distances as calculated above to the point pattern using %mark% and plot the Stienen diagram
rand_ppp %mark% (rand_nn) %>%
  plot(markscale = 1, main = "Stienen diagram")

In a null landscape, the distribution of the size of the symbols would tend to be random!

The concept of nearest neighbors is useful to define a family of techniques that are based on the distribution of distances to nearest neighbors. Three such techniques are introduced here.

13.6 \(G\)-function

As you have seen above, the distribution of distances to nearest neighbors presents distinctive characteristics for different types of patterns.

What is needed is a convenient way to summarize the distribution of distances to nearest neighbors. A way to do so is by means of a plot of the cumulative distribution function. A cumulative distribution is simply the proportion of events that have a nearest neighbor at a distance less than some value \(x\). When the value of \(x\) is very small, no events have a nearest neighbor at \(d_{ij}<x\). When \(x\) is very large all events have a nearest neighbor at \(d_{ij}<x\). The cumulative distribution thus depends on the value of \(x\).

Imagine for instance the following hypothetical distribution of distances of ten events to their nearest neighbors (the first event’s nearest neighbor is at a distance of 1, the second event’s nearest neighbor is at 2, the third’s at 0.5, and so on):

nnd <- c(1, 2, 0.5, 2.5, 1.7, 4, 3.5, 1.2, 2.3, 2.8)

When \(x = 0\), zero events have a nearest neighbor at that distance or less. Two events have nearest neighbors at distances \(d_{ij} <= 1\). Five events have a nearest neighbor at distances \(d_{ij} <= 2\). Eight events have a nearest neighbor at dist \(d_{ij} <= 3\). And all events have a nearest neighbor at distances \(d_{ij} <= 4\).

We can plot these numbers of events as a proportion:

# Create a data frame for plotting the proportion of events with a nearest neighbor at a distance $d_ij <= x$
df <- data.frame(x = c(0, 1, 2, 3, 4), proportion = c(0, 3/10, 5/10, 8/10, 10/10))

# `geom_line()` creates lines that connect the coordinates of the data inputs
ggplot() + 
  geom_line(data = df, aes(x = x, y = proportion))

The cumulative distribution function of distances from event to nearest neighbor is called a \(G\)-function.

This function is defined as follows, with \(d_{ik}\) as the distance from the event at i to its nearest neighbor: \[ \hat{G}(x)=\frac{(d_{ik}\le x, \forall i)}{n} \]

This function (with a hat, because it is estimated from the data), can be used to explore spatial point patterns. When doing so, it is useful to know that the theoretical value of \(G\) (assuming a null landscape generated by a Poisson distribution) is as follows: \[ G_{pois}(x) = 1 - exp(-\lambda \pi x^2). \]

When the empirical \(\hat{G}(x)\) is greater than the theoretical function, this suggests that the events tend to be closer than expected, compared to the null landscape. This would be indicative of a pattern of events that form clusters. On the contrary, when the empirical function is less than the theoretical function, this would suggest that the events tend to be further away from each other than expected, compared to the null landscape. This would be indicative of a dispersed or regular pattern.

The \(G\)-function is implemented in spatstat as Gest (for \(G\) estimated):

# Use split to calculate the G-function only for "Pattern 1"
g_pattern1 <- Gest(split(pp0.ppp)$"Pattern 1", correction = "none")

(For the moment ignore the argument “correction”; we will discuss corrections later on.)

The plot() function can be used to visualize the estimated G (with r = x):

plot(g_pattern1)

In the plot above, the empirical function is the solid black line, and the theoretical is the dashed red line.

If you examine the empirical function, you will see that about 50% of events have a nearest neighbor at a distance of less than approximately 0.04. In the null landscape (theoretical function), in contrast, only about 16% of events have a nearest neighbor at less than 0.04:

plot(g_pattern1)
lines(x = c(0.04, 0.04), y = c(-0.1, 0.5), lty = "dotted")
lines(x = c(-0.1, 0.04), y = c(0.5, 0.5), lty = "dotted")
lines(x = c(-0.1, 0.04), y = c(0.16, 0.16), lty = "dotted", col = "red")

Notice that the empirical function is above the theoretical function. This suggests is that in the actual landscape events tend to be much closer to other events in comparison the null landscape, and would therefore be suggestive of clustering.

Compare to “Pattern 2”:

g_pattern2 <- Gest(split(pp0.ppp)$"Pattern 2", correction = "none")
plot(g_pattern2)

Now the empirical function is below the one for the null landscape. Notice too that all events have a nearest neighbor in a limited range of distances, between 0.14 and 0.18. This is indicative of a dispersed, or regular pattern.

And the random pattern that you created before:

g_pattern_rnd <- Gest(rand_ppp, correction = "none")
plot(g_pattern_rnd)

In this case, the empirical function more closely resembles the theoretical function for the null landscape. This suggests a random pattern.

By considering the distribution of distances to nearest neighbors, you can generate additional information on a point pattern to complement the density-based analysis of the preceding chapters.

14 Activity 6: Point Pattern Analysis III

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

14.1 Practice questions

Answer the following questions:

  1. List and explain two limitations of quadrat analysis.
  2. What is clustering? What could explain a clustering in a set of events?
  3. What is regularity? What could explain it?
  4. Describe the concept of nearest neighbors.
  5. What is a cumulative distribution function?

14.2 Learning objectives

In this activity, you will:

  1. Explore a dataset using distance-based approaches.
  2. Compare the characteristics of different types of patterns.
  3. Discuss ways to evaluate how confident you are that a pattern is random.

14.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

14.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here):

library(isdas)
library(maptools) # Needed to convert `SpatialPolygons` into `owin` object
library(tidyverse)
library(sf)
library(spatstat)

In the practice that preceded this activity, you learned about the concepts of intensity and density, about quadrats, and also how to create density maps. For this practice, you will use the data that you first encountered in Activity 4, that is, the business locations in Toronto.

Begin by reading the geospatial files, namely the city boundary of Toronto. You need the sf object, which will be converted into a spatstat window object:

data("Toronto")

Convert the sf object to an owin object (via SpatialPolygons, hence as(x, "Spatial"):

Toronto.owin <- as.owin(as(Toronto, "Spatial")) # Requires `maptools` package

Next the data that you will use in this activity needs to be loaded. Each dataframe is converted into a ppp object using the as.ppp function, again after extracting the coordinates of the events from the sf object:

data("Fast_Food")
Fast_Food.ppp <- as.ppp(st_coordinates(Fast_Food), W = Toronto.owin)
# Add the classes of fast food to the ppp object:
marks(Fast_Food.ppp) <- Fast_Food$Class

data("Gas_Stands")
Gas_Stands.ppp <- as.ppp(st_coordinates(Gas_Stands), W = Toronto.owin)

data("Paez_Mart")
Paez_Mart.ppp <- as.ppp(st_coordinates(Paez_Mart), W = Toronto.owin)

If you inspect your workspace, you will see that the following ppp objects are there:

  • Fast_Food.ppp
  • Gas_Stands.ppp
  • Paez_Mart.ppp

These are locations of fast food restaurants and gas stands in Toronto (data are from 2008). Paez Mart on the other hand is a project to cover Toronto with convenience stores. The points are the planned locations of the stores.

You can check the contents of ppp objects by means of summary:

summary(Fast_Food.ppp)
## Marked planar point pattern:  614 points
## Average intensity 9.681378e-07 points per square unit
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
## 
## Multitype:
##           frequency proportion    intensity
## Chicken          82  0.1335505 1.292953e-07
## Hamburger       209  0.3403909 3.295453e-07
## Pizza           164  0.2671010 2.585906e-07
## Sub             159  0.2589577 2.507067e-07
## 
## Window: polygonal boundary
## 10 separate polygons (no holes)
##             vertices        area relative.area
## polygon 1       4185 630935000.0      9.95e-01
## polygon 2        600   2536260.0      4.00e-03
## polygon 3        193    237206.0      3.74e-04
## polygon 4         28     26539.7      4.18e-05
## polygon 5         52    142793.0      2.25e-04
## polygon 6         67    158439.0      2.50e-04
## polygon 7         41     83470.2      1.32e-04
## polygon 8         30     42934.1      6.77e-05
## polygon 9         36     33866.6      5.34e-05
## polygon 10         8     11069.2      1.75e-05
## enclosing rectangle: [609550.5, 651611.8] x [4826375, 4857439] units
##                      (42060 x 31060 units)
## Window area = 634207000 square units
## Fraction of frame area: 0.485

Now that you have the data that you need in the right format, you are ready for the next activity.

14.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Calculate the event-to-event distances to nearest neighbors using the function nndist(). Do this for all fast food establishments (pooled) and then for each type of establishment (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”).

  2. Create Stienen diagrams using the distance vectors obtained in Step 1.

  3. Plot the empirical G-function for all fast food establishments (pooled) and then for each type of establishment (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”).

Activity Part II

  1. Discuss the diagrams that you created in Question 2 with a fellow student.

  2. Is there evidence of clustering/regularity?

  3. How confident are you to make a decision whether the patterns are not random? What could you do to assess your confidence in making a decision whether the patterns are random? Explain.

15 Point Pattern Analysis IV

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In the last practice/session your learning objectives included:

  1. Learning about clustered and dispersed (or regular) patterns.
  2. Learning the concept of nearest neighbors.
  3. Learning about distance-based methods for point pattern analysis.
  4. Learning about the \(G\)-function for the analysis of event-to-event nearest neighbor distances.

15.1 Learning Objectives

In this chapter, you will:

  1. Learn about the \(F\)- or empty space function.
  2. Consider the issue of patterns at multiple scales.
  3. Learn about the \(K\)-function.
  4. Apply both of these techniques using a simple example.

15.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex.
  • Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapters 7 - 8. CRC: Boca Raton.
  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

15.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(spatstat)
library(tidyverse)

Load the datasets that you will use for this practice:

data("pp1_df")
data("pp2_df")
data("pp3_df")
data("pp4_df")
data("pp5_df")

These five dataframes include the coordinates of events set in the space of a unit square. To convert these dataframes into ppp objects we first define a window:

# We use "owin" to define a window of coordinates which is in the five dataframes. 
W <- owin(c(0, 1), c(0, 1))

And then use the function as.ppp to convert into ppp:

# `as.ppp()` is a function that we use to convert dataframes into ppp objects
pp1.ppp <- as.ppp(pp1_df, W = W)
pp2.ppp <- as.ppp(pp2_df, W = W)
pp3.ppp <- as.ppp(pp3_df, W = W)
pp4.ppp <- as.ppp(pp4_df, W = W)
pp5.ppp <- as.ppp(pp5_df, W = W)

15.4 Motivation

Distance-based approaches like the \(\hat{G}\)-function provide a useful complement to density-based approached. They can be implemented in more ways than we have seen so far.

In this practice, you will learn about two more tools for conducting distance-based analysis, the \(\hat{F}\)-function and the \(\hat{K}\)-function.

15.5 F-function

The \(\hat{G}\)-function was defined as the cumulative distribution of the distances from events to their nearest neighboring event. The \(\hat{F}\)-function is based on the same premise, but instead of using event-to-event distances, it uses point-to-event distances.

Recall that a point is an arbitrary location on a map that is not necessarily the location of an event. It may well be (and typically is) empty space. For this reason, the \(\hat{F}\)-function is sometimes called the empty space function: when there is more empty space in a region, the distance from a point to the nearest neighboring event is typically longer.

More formally, this function is defined as follows, with \(d_{ik}\) as the distance from the point at \(i\) (not necessarily an event!) to its nearest neighboring event at location \(k\): \[ \hat{F}(x)=\frac{(d_{ik}\le x, \forall i)}{n} \]

Again, we use the hat notation to indicate that the function is estimated from the data.

The theoretical distribution of this function is known (based on a null landscape generated by a spatially random Poisson process: remember that a Poisson process is a type of random process that consists of points randomly located on a landscape). It is as follows: \[ F_{pois}(x) = 1 - exp(-\lambda \pi x^2). \]

Notice that the distribution is in fact identical to that for \(G\). This makes sense: if the distribution of events is spatially random, the distribution of empty space in the region must be random as well!

The interpretation of \(\hat{F}(x)\) is the opposite of \(\hat{G}(x)\): when the empirical \(\hat{F}(x)\) is greater than the theoretical function, this suggests that empty spaces are closer to events than expected, compared to the null landscape, as in a dispersed pattern. On the contrary, when the empirical function is less than the theoretical function, this would suggest a clustered pattern, since the events tend to be far away from the points used to calculate the function.

The \(\hat{F}\)-function can be implemented in at least two ways: (1) by using a fine grid to measure the distance to events; or (2) by measuring the distance to events from randomly drawn coordinates. The implementation in spatstat is the first one, which results in a pixel-based image of empty space.

We can illustrate this function with the point pattern pp1.ppp. First, we verify that pp1.ppp is already a ppp object:

class(pp1.ppp)
## [1] "ppp"

Begin by plotting the pattern:

plot(pp1.ppp)

An empty space map is obtained by means of the distmap() function:

# The "distmap()" function computes the distance map of point pattern X and returns the distance map as a pixel image

empty_space_map1 <- distmap(pp1.ppp) 

The plot of this is:

plot(empty_space_map1)

Similar to the Stienen diagrams that you used previously, this map shows the distance from any location on the map to the nearest event: the smaller the value, the closer the point is to an event. It is evident in this pixel image that the values are mostly smaller, illustrating that points are closer to events.

Compare the map above to pp2.ppp:

empty_space_map2 <- distmap(pp2.ppp)
plot(empty_space_map2)

In the second point pattern, there is more open space in the region. This is also apparent from the symbols map:

plot(pp2.ppp)

The \(\hat{F}\)-function is implemented in spatstat as Fest() (for F-estimated), and it requires a ppp object as an input. Another possible input is whether a correction is to be used. This refers to boundary corrections. Since we have not yet discussed them, select “none”:

# The "Fest()" function computes an estimate of the empty space function, and it also called the "point to nearest event" distribution. This function estimates the nearest neighbors of a point (in this example, for pp1)
f_pattern1 <- Fest(pp1.ppp, correction = "none")

This function can be plotted as follows:

plot(f_pattern1)

The black line is the empirical function, and we see that it is in general very similar to the theoretical function that corresponds to a null landscape. Compare to the second pattern:

f_pattern2 <- Fest(pp2.ppp, correction = "none")
plot(f_pattern2)
lines(x = c(0, 0.097), y = c(0.4, 0.4), col = "blue", lty = "dotted")
lines(x = c(0.045, 0.045), y = c(0.0, 0.4), col = "blue", lty = "dotted")
lines(x = c(0.097, 0.097), y = c(0.0, 0.4), col = "blue", lty = "dotted")

In the empirical (black) pattern, points on a grid tend to be more distant from events than what you would expect from the null landscape. For example, whereas under the theoretical function 40% of points have a nearest event that is at a distance of approximately 0.045 or less, under the empirical function, the events are generally more distant from the points, and for the same value of F (0.4 or 40%) the distance is closer to 0.1. See:

# Repeat the plot of the F-function of `pp2.ppp` and use the function `lines()` to add lines to compare the distances for a given value of F, say 0.4 (or 40%)
plot(f_pattern2)
lines(x = c(0, 0.097), y = c(0.4, 0.4), col = "blue", lty = "dotted")
lines(x = c(0.045, 0.045), y = c(0.0, 0.4), col = "blue", lty = "dotted")
lines(x = c(0.097, 0.097), y = c(0.0, 0.4), col = "blue", lty = "dotted")

This suggests that the points are clustered. Try plotting the \(\hat{G}\)-functions for the patterns in this example, and compare.

15.6 \(\hat{K}\)-function

A limitation of the two techniques that we have seen so far is that they deal with a single scale: the distance to the first nearest neighbor (or, more generally, to the \(k\)-th nearest neighbor; these functions can be used for the 2nd, 3rd, and so on nearest neighbor!). Their single scale nature means that these functions can easily miss patterns when they are only evident at different scales.

Consider for instance the following point pattern:

plot(pp3.ppp)

The events above initially appear to be clustered. However, at a different scale, a second pattern becomes evident. In fact, what we observe is a regular distribution of clusters. At a smaller scale, a single cluster may actually be a random distribution of events. In contrast, the following pattern appears to be a random distribution of regularly spaced events:

plot(pp4.ppp)

Whereas the last point pattern is of clusters of dispersed events that are themselves regularly spaced:

plot(pp5.ppp)

Both \(\hat{G}(x)\) or \(\hat{F}(x)\) when applied to any of these patterns will strongly hint at clustering at the scale of the first nearest neighbor. Regrettably, they fail to detect patterns that might exist at other scales. For instance:

f_pattern3 <- Fest(pp3.ppp, correction = "none")
plot(f_pattern3)

g_pattern3 <- Gest(pp3.ppp, correction = "none")
plot(g_pattern3)

A different technique, called the \(\hat{K}\)-function, is designed to detect patterns at multiple scales (see Ripley 1976; and Haase 1995). The intuition behind the function is as follows.

Imagine that you visit every on of the events in the point patter in sequence. Each time you visit an event you do the following: first, you create a circle with radius “x” centered on the event, and then you count the number of events that are within the circle. Then you increase “x” by some distance, and repeat the process. Once that you have created the last circle (which will be suitably large to capture patterns at that scale), you move and visit the next event in the pattern and repeat the exact same process. These counts of events at distances “x” are aggregated and normalized by the estimated intensity of the point pattern.

More formally, this is (with \(A\) as the area of the region): \[ \hat{K}(x)=\frac{1}{\hat{\lambda}A}\sum_{i}\sum_{j\neq i}(d_{ij}\le x). \]

As before, the theoretical values for this function are known for the case of a null landscape generated by a Poisson process: \[ K_{pois}(x)=\pi x^2. \] When the empirical function is greater than the theoretical function, this would suggest that events are typically surrounded by more events at that distance than what the null landscape would have. This is interpreted as evidence of clustering.

In contrast, when the empirical function is less than the theoretical one, this would suggest that events are typically surrounded by fewer events at that distance than what would be expected from a null landscape. This is interpreted as dispersion.

The \(\hat{K}\)-function is implemented in the package spatstat as Kest().

To see how this function works, plot pp3.ppp once more:

plot(pp3.ppp)

Next, use Kest() to calculate and plot the \(\hat{K}\)-function:

# `Kest()` function estimates nearest neighbors of a point on multiple scales, identifying more than just the distance to the first nearest neighbor. Here, we are applying the K-function to `pp3.ppp`. As before, ignore the correction; we will discuss this later 
k_pattern3 <- Kest(pp3.ppp, correction = "none")
plot(k_pattern3)

As seen from the plot, the function is suggestive of clustering at smaller scales, but regularity at a larger scale.

Try this now with the last pattern:

plot(pp5.ppp)

If you calculate and plot the \(\hat{K}\)-function:

k_pattern5 <- Kest(pp5.ppp, correction = "none")
plot(k_pattern5)

You will see that the plot correctly suggests dispersion at the very small scale, followed by clustering at an intermediate scale. There are indeed clusters of nine events surrounded by empty space, before other clusters of regular events are detected at the largest scale, following a regular pattern.

Of the distance-based techniques that you have seen so far, \(\hat{G}(x)\) and \(\hat{F}(x)\) are often used as complements. The \(\hat{K}(x)\) is useful when exploring multi-scale patterns.

This concludes the chapter, and our coverage of distance-based techniques.

16 Activity 7: Point Pattern Analysis IV

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

16.1 Practice questions

Answer the following questions:

  1. What does the \(\hat{G}\)-function measure?
  2. What does the \(\hat{F}\)-function measure?
  3. How do these two functions relate to one another?
  4. Describe the intuition behind the \(\hat{K}\)-function.
  5. How does the \(\hat{K}\)-function capture patterns at multiple scales?

16.2 Learning objectives

In this activity, you will:

  1. Explore a dataset using single scale distance-based techniques.
  2. Explore the characteristics of a point pattern at multiple scales.
  3. Discuss ways to evaluate how confident you are that a pattern is random.

16.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

16.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here):

library(isdas)
library(maptools) # Needed to convert `SpatialPolygons` into `owin`-class object
library(sf)
library(spatstat)
library(tidyverse)

For this activity, you will use the same datasets that you used in Activity 6, including the geospatial files for Toronto’s city boundary:

data("Toronto")

Convert the sf object to an owin object (via SpatialPolygons, hence as(x, "Spatial"):

Toronto.owin <- as.owin(as(Toronto, "Spatial")) # Requires `maptools` package

Next, load the data that you will use in this activity. Each dataframe is converted into a ppp object using the as.ppp function, again after extracting the coordinates of the events from the sf object:

data("Fast_Food")
Fast_Food.ppp <- as.ppp(st_coordinates(Fast_Food), W = Toronto.owin)
# Add the classes of fast food to the ppp object:
marks(Fast_Food.ppp) <- Fast_Food$Class

data("Gas_Stands")
Gas_Stands.ppp <- as.ppp(st_coordinates(Gas_Stands), W = Toronto.owin)

data("Paez_Mart")
Paez_Mart.ppp <- as.ppp(st_coordinates(Paez_Mart), W = Toronto.owin)

Now that you have the data sets in the appropriate format, you are ready for the next activity.

16.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Plot the empirical \(\hat{F}\)-function for all fast food establishments (pooled) and then for each type of establishment separately (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”).

  2. Plot the empirical \(\hat{K}\)-function for all fast food establishments (pooled) and then for each type of establishment (i.e, “Chicken”, “Hamburger”, “Pizza”, “Sub”).

Activity Part II

  1. Discuss your results with a fellow student. Is there evidence of clustering/regularity?

  2. What can you say about patterns at multiple-scales based on the graphs above?

  3. How confident are you to make a decision whether the patterns are not random? What could you do to assess your confidence in making a decision whether the patterns are random? Explain.

17 Point Pattern Analysis V

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In the last practice/session your learning objectives included:

  1. Learning about the \(\hat{F}\)- or empty space function.
  2. Considering the issue of patterns at multiple scales.
  3. Learning about the \(\hat{K}\)-function.
  4. Applying these techniques using a simple example.

Please review the previous practices if you need a refresher on these concepts.

If you wish to work interactively with this chapter you will need the following:

  • An R markdown notebook version of this document (the source file).

  • A package called isdas.

17.1 Learning Objectives

In this chapter, you will:

  1. Revisit the concept of hypothesis testing
  2. Revisit the concept of null landscapes.
  3. Learn about the use of simulation for hypothesis testing.
  4. Learn to implement simulation envelopes
  5. Consider some caveats when working with point patterns

17.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 3. Longman: Essex.
  • Baddeley A, Rubak E, Turner R (2016) Spatial Point Pattern: Methodology and Applications with R, Chapter 10. CRC: Boca Raton.
  • Bivand RS, Pebesma E, Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 7. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, 6.1 - 6.6. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

17.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(spatstat)
library(tidyverse)

Load the datasets that you will use for this practice:

data("pp1_df")
data("pp2_df")
data("pp3_df")
data("pp4_df")
data("pp5_df")

These five dataframes include the coordinates of events set in the space of a unit square. To convert these dataframes into ppp objects we first define a window:

W <- owin(c(0, 1), c(0, 1))

And then use the function as.ppp to convert into ppp:

pp1.ppp <- as.ppp(pp1_df, W = W)
pp2.ppp <- as.ppp(pp2_df, W = W)
pp3.ppp <- as.ppp(pp3_df, W = W)
pp4.ppp <- as.ppp(pp4_df, W = W)
pp5.ppp <- as.ppp(pp5_df, W = W)

17.4 Motivation: Hypothesis Testing

In the previous sessions you learned about density- and distance-based techniques for the analysis of spatial point patterns.

With the exception of the test of independence for quadrats, other techniques (including kernel density, the \(\hat{G}\)- and \(\hat{F}\)-functions, and the \(\hat{K}\)-function), did not have a formal hypothesis testing framework.

The question of “how confident are you when deciding whether a pattern is random” forms the basis of hypothesis testing. In other words, when making a decision whether the reject a null hypothesis, we would like to know what is the probability that we are making a mistake with the decision. Quantifying our uncertainty is a key feature of statistical analysis.

In statistics, tests of hypothesis are developed following these general steps:

  1. Identify a null hypothesis of interest, and if possible alternative hypotheses as well (although the latter is not always possible).

For instance, in point pattern analysis, a null hypothesis of interest is whether a pattern is random. If it is not, we would like to know in which way it is not random (i.e., is it clustered? Or on the contrary, is it regular?)

  1. Derive the expected value of the summary statistic of interest.

It the case of the \(\hat{G}\)-function, for instance, the expected value of the function under the null hypothesis of a spatially random Poisson process is: \[ G_{pois}(x) = 1 - exp(-\lambda \pi x^2). \]

Similar expressions were presented for the \(\hat{F}\)-function and \(\hat{K}\)-function, but not for kernel density estimates. When the expected value of the function is known, the closer the empirical function is to its expected value, the more likely it is that the null hypothesis is true.

For instance, the \(\hat{G}\)-function of the pattern in pp1.ppp is shown below. It is quite close to the theoretical function, so the pattern is probably random. The question is, how probable is this?

g_pp1 <- Gest(pp1.ppp, correction = "none") 
plot(g_pp1)

  1. To make a decision whether to reject the null hypothesis (or contrariwise, fail to reject it), we need to know how close is close to the expected value. This step depends on how much variability there is of the random process around its expected value. In other words, we need to know the variance of the expected value under the null hypothesis.

Unfortunately, the variance of the theoretical random processes is not known in the case of many spatial point pattern techniques (the quadrat-based test of independence is an exception.) For a long time, this meant that the techniques remained purely descriptive, and it was not possible to quantify uncertainty when trying to decide whether a pattern was random: the decision would remain purely subjective.

Fortunately, with the growth in use of computers in statistical analysis, the lack of theoretical expressions for the variance can be circumvented by means of simulation. Simulation has many applications in statistics, and is certainly relevant in the analysis of point patterns, allowing us to generate null landscapes with ease.

17.5 Null Landscapes Revisited

A null landscape is a landscape produced by a random process. In previous practices you saw various different ways of generating null landscapes. A useful way of generating null landscapes for point patterns is by means of a Poisson process. The package spatstat implements this by means of the function rpoisp. This function generates a null landscape given an intensity parameter and a window.

Before creating a null landscape, we can check the characteristics of the patterns in the dataset:

summary(pp1.ppp)
## Planar point pattern:  81 points
## Average intensity 81 points per square unit
## 
## Coordinates are given to 8 decimal places
## 
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit

You can verify that the intensity in every case is 81 points per square unit, and the window is a square unit.

Lets copy the window from one of the patterns in the sample dataset:

# We can use `$` to index an item in the object `pp1.ppp`
W <- pp1.ppp$window

It is possible to generate a null landscape as follows, by means of the function rpoisppp(). The arguments of this function are a desired intensity (\(\lambda\)) and a window:

# The function `rpoisppp()` is used to generate null landscapes based on the Poisson distribution
sim1 <- rpoispp(lambda = 81, win = W)

The value (i.e., output) of this function is a ppp object that can be analyzed in all the ways that you already know. For instance, you can plot it:

plot(sim1)

Importantly, you can apply any of the techniques that you have seen so far, for instance, the \(\hat{G}\)-function:

g_sim1 <- Gest(sim1, 
               correction = "none")

We can try plotting the empirical functions (notice that the result of Gest is a dataframe with the values of r, the distance variable, the raw or empirical function, and the theoretical function). To plot using ggplot2 you can stack the two dataframes as follows (after adding a factor to indicate if it is the empirical function or a simulation):

# Use `data.frame()` to create a table with the relevant elements of the `g_pp1` object; in this example we take `raw` and put it in a column called `G`, we take `r` and put it in a column called `r` and create a new variable called `Type` to indicate that these values are for the "Empirical" function. Then we use `rbind()` to bind the rows of this data frame, and a second data frame that keeps the same columns, but based on the simulated null landscape
g_all <- data.frame(G = g_pp1$raw, 
                    x = g_pp1$r, 
                    Type = "Pattern 1")
g_all <- rbind(g_all, 
               data.frame(G = g_sim1$raw, 
                          x = g_sim1$r, 
                          Type = "Simulation"))

We can use ggplot2 to create a plot of the two functions:

# By assigning `Type` to the aesthetic of `color` in `ggplot()`, we plot lines of different types in different colors
ggplot(data = g_all,
       aes(x= x, 
           y = G, 
           color = Type)) + 
  geom_line()

After seeing the plot above, we notice that the empirical function is very, very similar to the simulated null landscape. But is this purely a coincidence? After all, when we simulate a null landscape, there is the possibility, however improbable, that it will replicate some meaningful process purely by chance. To be sure, we can simulate and analyze a second null landscape:

sim2 <- rpoispp(lambda = 81, win = W)
g_sim2 <- Gest(sim2, 
               correction = "none")
g_all <- rbind(g_all,
               data.frame(G = g_sim2$raw, 
                          x = g_sim2$r, 
                          Type = "Simulation"))

Plot again:

ggplot(data = g_all, 
       aes(x= x, 
           y = G, 
           color = Type)) + 
  geom_line()

The empirical function continues to look very similar to the simulated null landscapes. We could simulate more null landscapes and increase our confidence that the empirical function indeed is similar to a null landscape (notice the use of a for loop to repeat the same instructions multiple times):

# Flow control functions include `for()`; this function will repeat the statements that follow a set number of times. In this example, we had already simulated 2 null landscapes above, so we want to simulate null landscapes 3 through 99
for(i in 3:99){
  g_sim <- Gest(rpoispp(lambda = 81, 
                        win = W), 
                correction = "none")
  g_all <- rbind(g_all, 
                 data.frame(G = g_sim$raw, 
                            x = g_sim$r, 
                            Type = "Simulation"))
}

With this we have generated 99 distinct null landscapes. Try plotting the empirical function with the functions of all of these simulated landscapes:

ggplot(data = g_all, 
       aes(x= x,
           y = G,
           color = Type)) + 
  geom_line()

You can see in the plot above that the empirical function is actually not visible! It is obscured by the null landscapes, since it falls somewhere within the limits of the functions for all the simulated patterns. The interpretation of this is as follows: out of 100 patterns (the empirical pattern and 99 null landscapes), the empirical pattern is not noticeably different from the random ones. How confident would you be rejecting the null hypothesis, i.e., deciding that the empirical pattern is not random?

We can follow the same process but now for the second pattern pp2.ppp to the simulated null landscapes:

# Compute the G-function for the point pattern in `pp2.ppp` and then extract the value of G, the distance, and label it as an "Empirical" function in a new data frame (by means of `transmute()`)
g_pp2 <- Gest(pp2.ppp, 
              correction = "none")
g_pp2 <- data.frame(G = g_pp2$raw, 
                    x = g_pp2$r, 
                    Type = "Pattern 2")

# Bind the results of the G-function for `pp2.ppp` to the data frame with the simulations, and use `mutate()` to convert `Type` into a factor
g_all <- rbind(g_all, 
               g_pp2)
g_all <- mutate(g_all, 
                Type = factor(Type,
                              levels = c("Pattern 1", 
                                         "Pattern 2", 
                                         "Simulation")))

# Use filter to remove all observations associated with "Pattern 1"; in this case, Type not equal (i.e., `!=`) to "Pattern 1". This way we can plot only the G-function of "Pattern 2" and the simulations
ggplot(data = filter(g_all, 
                     Type != "Pattern 1"), 
       aes(x= x, 
           y = G,
           color = Type)) + 
  geom_line()

We can see that the empirical \(\hat{G}\)-function of pp2.ppp is quite distinct from the 99 null landscapes that we generated! How confident would you be rejecting the null hypothesis now?

17.6 Simulation Envelopes

Simulation, as seen above, can be quite powerful for hypothesis testing in situations where the theoretical parameters, for example the variance of a function, are not know. Essentially, the area covered by the \(\hat{G}\)-functions of the simulated landscapes above are an estimate of the variance of the function. The set of functions estimated on the null landscapes are used to obtain what we call simulation envelopes.

Since we lack a theoretical expression for the variance, we cannot obtain \(p\)-values to inform our decision to reject the null hypothesis. The simulation, however, provides a pseudo-\(p\)-value. If you generate 99 null landscapes, and the empirical pattern is still different, the probability that you are mistaken by rejecting the null hypothesis is at most 1% (since the next simulated landscape could expand the envelopes in such a way that it completely contains the empirical function).

As you saw above, using simulation for hypothesis testing is, in general terms, a relatively straightforward process (assuming that the null process is properly defined, etc.) The package spatstat includes a function, called envelope(), that can be used to generate simulation envelopes for several statistics used in point pattern analysis. For instance, for the \(\hat{G}\)-function, with 99 simulated landscapes:

# The function `envelope()` automates what we did above, simulating null landscapes; it takes as arguments a `ppp` object for the empirical pattern, a function that we desire to test, for example the function `Gest`, as well as the number of simulations that we wish to conduct. An additional argument `funargs = ` is used to pass other arguments to the function that is evaluated, i.e., in this example `Gest`
env_pp1 <- envelope(pp1.ppp,
                    Gest, 
                    nsim = 99, 
                    funargs = list(correction = "none"))
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
## 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

The envelopes can be plotted:

plot(env_pp1)

It is easy to see that in this case the empirical function falls within the simulation envelopes, and thus it is very unlikely to be different from the null landscapes.

Also, the \(\hat{F}\)-function:

env_pp2 <- envelope(pp2.ppp, 
                    Fest, 
                    nsim = 99, 
                    funargs = list(correction = "none"))
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
## 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.
plot(env_pp2)

Now the empirical function lies well outside the simulation envelopes, which makes it very unlikely that it is similar to the null landscapes.

And finally, the \(\hat{K}\)-function:

env_pp3 <- envelope(pp3.ppp, 
                    Kest, 
                    nsim = 99, 
                    funargs = list(correction = "none"))
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
## 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.
plot(env_pp3)

Again, the empirical function lies mostly outside of the simulation envelopes, meaning that it is very improbable that it represents a random process. Simulation envelopes are a powerful way to test the hypothesis of null landscapes in the case of spatial point patterns.

17.7 Things to Keep in Mind!

Before concluding the topic of point pattern analysis, here are a few important caveats to keep in mind.

17.7.1 Definition of a Region

When defining the region (or window) for the analysis, care must be taken that it is reasonable from the perspective of the process under analysis. Defining the region in an inappropriate way can easily lead to misleading results.

Consider for instance the first pattern in the dataset. This pattern was defined for a unit-square window. We can apply the \(\hat{K}\)-function to it:

k_env_pp1 <- envelope(pp1.ppp, 
                      Kest, 
                      nsim = 99,
                      funargs = list(correction = "none"))
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
## 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.
plot(k_env_pp1)

Based on this we would most likely conclude that the pattern is random.

But if we replace the unit-square window by a much larger window, as follows:

W2 <- owin(x = c(-2,4), 
           y = c(-2, 4))
pp1_reg2 <- as.ppp(as.data.frame(pp1.ppp), 
                   W = W2)
plot(pp1_reg2)

In the context of the larger window, the point pattern now looks clustered! See how the definition of the window would change your conclusions regarding the pattern:

k_env_pp1_reg2 <- envelope(pp1_reg2, 
                           Kest, 
                           nsim = 99, 
                           funargs = list(correction = "none"))
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
## 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.
plot(k_env_pp1_reg2)

Care must be taken when defining the window/region for analysis to avoid spurious results.

17.7.2 Edge Effects

As discussed above, definition of the window (region) is critical. If at all possible, the region should be selected in such a way that it is consistent with the underlying process. This is not always possible, either because the underlying process is not known, or because of limitations in data collection capabilities.

When this is the case, it is necessary to define a boundary that does not correspond necessarily with the extent of the process of interest. For example, analysis of business locations in Toronto may be limited to the city limits. This does not mean that establishments do not exist beyond those boundaries. When the extent of the process exceeds the window used in the analysis, the point pattern is observed only partially, and it is possible that the omitted information regarding the location of events beyond the boundary may introduce some bias.

Consider the situation illustrated in Figure @ref(fig:edge-effects).

\label{fig:edge-effects}Edge effects

(#fig:edge-effects)Edge effects

In the figure, the region is the rectangular window. Events are observed only inside the window, but events still exist beyond the edges of the window. It is straightforward to see how the empty space (\(\hat{F}\)-) function would be biased, since locations near the edge would appear the be more distant from an event than they actually are.

Several corrections are available in spatstat to deal with the possibility of edge effects. So far, we have used the argument correction = "none" when applying the functions. The following alternative corrections are implemented: “none”, “rs”, “km”, “cs” and “best”. Alternatively correction = "all" selects all options.

These corrections are variations of weighting schemes. In other words, the statistic is weighted to give an unbiased estimator. See:

plot(Gest(pp2.ppp, 
          correction = "all"))

The different corrections are plotted. It can be seen in this case that the corrections are relatively small, relative to the uncorrected empirical line; however, this is not always the case.

17.7.3 Sampled Point Patterns

Whereas edge effects can introduce bias by censoring the observations outside of the window/region, another issue emerges when not all events are observed inside the window.

We have assumed so far that any point pattern under analysis consists of a census of events, or in other words, that all relevant events have been recorded. A sampled point pattern, on the other hand, is a pattern where not all events have been recorded (see Figure @ref(fig:sampled-pattern)).

\label{fig:sampled-pattern}Sampled point pattern

(#fig:sampled-pattern)Sampled point pattern

The bias introduced by sampled point patterns can be extremely serious, because the findings depend heavily of the observations that were recorded as well as those that were not recorded! Clustered events could easily give the impression of a dispersed pattern, depending on what was observed. Imagine for instance that the events are nests of birds. If the birds tend to nest in the thickest parts of the forest that observers cannot easily access, the “observed” pattern will depend crucially on the trails and other routes of access that the researcher can use.

There are no good solutions to bias introduced by sampled point patterns, and it is not recommended to use the techniques discussed here with sampled point patterns.

This concludes the topic of spatial point patterns.

18 Activity 8: Point Pattern Analysis V

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

18.1 Practice questions

Answer the following questions:

  1. Describe the process to use simulation for hypothesis testing
  2. Why is the selection of an appropriate region critical for the analysis of point patterns?
  3. Discuss the issues associated with the edges of a region.
  4. What is a sampled point pattern?

18.2 Learning objectives

In this activity, you will:

  1. Explore a dataset using single scale distance-based techniques.
  2. Explore the characteristics of a point pattern at multiple scales.
  3. Discuss ways to evaluate how confident you are that a pattern is random.

18.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

18.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity. In addition to tidyverse, you will need spatstat, a package designed for the analysis of point patterns (you can learn about spatstat here and here):

library(isdas)
library(spatstat)
library(tidyverse)

Load a dataset of your choice. It could be one of the datasets that we have used before (Toronto Business Points, Bear GPS Locations), or one of the datasets included with the package spatstat. To see what datasets are available through the package, do the following:

vcdExtra::datasets("spatstat.data")
##                               Item class   dim
## 1                           Kovesi  list 41x13
## 2                         amacrine   ppp     6
## 3                         anemones   ppp     6
## 4                             ants   ppp     6
## 5                ants.extra (ants)  list     7
## 6                         austates  list     4
## 7                          bdspots  list     3
## 8                              bei   ppp     5
## 9                  bei.extra (bei)  list     2
## 10                       betacells   ppp     6
## 11                    bramblecanes   ppp     6
## 12                    bronzefilter   ppp     6
## 13                             btb   ppp     6
## 14                 btb.extra (btb)  list     2
## 15                           cells   ppp     5
## 16                       cetaceans  list   9x4
## 17     cetaceans.extra (cetaceans)  list     1
## 18                         chicago   ppx     3
## 19                         chorley   ppp     6
## 20         chorley.extra (chorley)  list     2
## 21                        clmfires   ppp     6
## 22       clmfires.extra (clmfires)  list     2
## 23                        concrete   ppp     5
## 24                          copper  list     7
## 25                       demohyper  list   3x3
## 26                         demopat   ppp     6
## 27                        dendrite   ppx     3
## 28                        finpines   ppp     6
## 29                             flu  list  41x4
## 30                         ganglia   ppp     6
## 31                          gordon   ppp     5
## 32                        gorillas   ppp     6
## 33       gorillas.extra (gorillas)  list     7
## 34                         hamster   ppp     6
## 35                         heather  list     3
## 36                      humberside   ppp     6
## 37  humberside.convex (humberside)   ppp     6
## 38                        hyytiala   ppp     6
## 39                   japanesepines   ppp     5
## 40                         lansing   ppp     6
## 41                         letterR  owin     5
## 42                        longleaf   ppp     6
## 43                          mucosa   ppp     6
## 44          mucosa.subwin (mucosa)  owin     4
## 45                       murchison  list     3
## 46                         nbfires   ppp     6
## 47         nbfires.extra (nbfires)  list     2
## 48              nbw.rect (nbfires)  owin     4
## 49               nbw.seg (nbfires)  list     5
## 50                         nztrees   ppp     5
## 51                           osteo  list  40x5
## 52                         paracou   ppp     6
## 53                       ponderosa   ppp     5
## 54     ponderosa.extra (ponderosa)  list     2
## 55                       pyramidal  list  31x2
## 56                         redwood   ppp     5
## 57                        redwood3   ppp     5
## 58                     redwoodfull   ppp     5
## 59 redwoodfull.extra (redwoodfull)  list     5
## 60                  residualspaper  list     7
## 61                         shapley   ppp     6
## 62         shapley.extra (shapley)  list     3
## 63                           simba  list  10x2
## 64                          simdat   ppp     5
## 65                       simplenet  list    10
## 66                         spiders   ppx     3
## 67                     sporophores   ppp     6
## 68                         spruces   ppp     6
## 69                      stonetools   ppp     6
## 70                    swedishpines   ppp     5
## 71                         urkiola   ppp     6
## 72                        vesicles   ppp     5
## 73       vesicles.extra (vesicles)  list     4
## 74                            waka   ppp     6
## 75                   waterstriders  list     3
##                                                                                        Title
## 1                                          Colour Sequences with Uniform Perceptual Contrast
## 2                                                                 Hughes' Amacrine Cell Data
## 3                                                                      Beadlet Anemones Data
## 4                                                            Harkness-Isham ants' nests data
## 5                                                            Harkness-Isham ants' nests data
## 6                                                 Australian States and Mainland Territories
## 7                                               Breakdown Spots in Microelectronic Materials
## 8                                                                 Tropical rain forest trees
## 9                                                                 Tropical rain forest trees
## 10                                                         Beta Ganglion Cells in Cat Retina
## 11                                                             Hutchings' Bramble Canes data
## 12                                                               Bronze gradient filter data
## 13                                                                  Bovine Tuberculosis Data
## 14                                                                  Bovine Tuberculosis Data
## 15                                                            Biological Cells Point Pattern
## 16                                            Point patterns of whale and dolphin sightings.
## 17                                            Point patterns of whale and dolphin sightings.
## 18                                                                        Chicago Crime Data
## 19                                                                Chorley-Ribble Cancer Data
## 20                                                                Chorley-Ribble Cancer Data
## 21                                                           Castilla-La Mancha Forest Fires
## 22                                                           Castilla-La Mancha Forest Fires
## 23                                                                   Air Bubbles in Concrete
## 24                                                   Berman-Huntington points and lines data
## 25                                       Demonstration Example of Hyperframe of Spatial Data
## 26                                                             Artificial Data Point Pattern
## 27                                                                     Dendritic Spines Data
## 28                                                                 Pine saplings in Finland.
## 29                                                                  Influenza Virus Proteins
## 30                                            Beta Ganglion Cells in Cat Retina, Old Version
## 31                                                                   People in Gordon Square
## 32                                                                     Gorilla Nesting Sites
## 33                                                                     Gorilla Nesting Sites
## 34                                                              Aherne's hamster tumour data
## 35                                                                     Diggle's Heather Data
## 36                                       Humberside Data on Childhood Leukaemia and Lymphoma
## 37                                       Humberside Data on Childhood Leukaemia and Lymphoma
## 38                                                   Scots pines and other trees at Hyytiala
## 39                                                              Japanese Pines Point Pattern
## 40                                                               Lansing Woods Point Pattern
## 41                                                               Window in Shape of Letter R
## 42                                                              Longleaf Pines Point Pattern
## 43                                                                   Cells in Gastric Mucosa
## 44                                                                   Cells in Gastric Mucosa
## 45                                                                   Murchison gold deposits
## 46                                              Point Patterns of New Brunswick Forest Fires
## 47                                              Point Patterns of New Brunswick Forest Fires
## 48                                              Point Patterns of New Brunswick Forest Fires
## 49                                              Point Patterns of New Brunswick Forest Fires
## 50                                                           New Zealand Trees Point Pattern
## 51                       Osteocyte Lacunae Data: Replicated Three-Dimensional Point Patterns
## 52                                                   Kimboto trees at Paracou, French Guiana
## 53                                                         Ponderosa Pine Tree Point Pattern
## 54                                                         Ponderosa Pine Tree Point Pattern
## 55                                                     Pyramidal Neurons in Cingulate Cortex
## 56                                       California Redwoods Point Pattern (Ripley's Subset)
## 57                                       California Redwoods Point Pattern (Ripley's Subset)
## 58                                        California Redwoods Point Pattern (Entire Dataset)
## 59                                        California Redwoods Point Pattern (Entire Dataset)
## 60                                     Data and Code From JRSS Discussion Paper on Residuals
## 61                                                      Galaxies in the Shapley Supercluster
## 62                                                      Galaxies in the Shapley Supercluster
## 63            Simulated data from a two-group experiment with replication within each group.
## 64                                                                   Simulated Point Pattern
## 65                                                          Simple Example of Linear Network
## 66                                               Spider Webs on Mortar Lines of a Brick Wall
## 67                                                                          Sporophores Data
## 68                                                                     Spruces Point Pattern
## 69                                                                  Palaeolithic Stone Tools
## 70                                                               Swedish Pines Point Pattern
## 71                                                               Urkiola Woods Point Pattern
## 72                                                                             Vesicles Data
## 73                                                                             Vesicles Data
## 74                                                               Trees in Waka national park
## 75 Waterstriders data.  Three independent replications of a point pattern formed by insects.

Load a dataset of your choice.

You can do this by using the load() function if the dataset is in your drive (e.g., the GPS coordinates of the bear).

On the other hand, if the dataset is included with the spatstat package you can do the following, for example to load the gorillas dataset:

gorillas.ppp <- gorillas

As usual, you can check the object by means of the summary function:

summary(gorillas.ppp)
## Marked planar point pattern:  647 points
## Average intensity 3.255566e-05 points per square metre
## 
## *Pattern contains duplicated points*
## 
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 metres
## 
## Mark variables: group, season, date
## Summary:
##     group              season               date           
##  Length:647         Length:647         Min.   :2006-01-06  
##  Class :character   Class :character   1st Qu.:2007-03-15  
##  Mode  :character   Mode  :character   Median :2008-02-05  
##                                        Mean   :2007-12-14  
##                                        3rd Qu.:2008-09-23  
##                                        Max.   :2009-05-31  
## 
## Window: polygonal boundary
## single connected closed polygon with 21 vertices
## enclosing rectangle: [580457.9, 585934] x [674172.8, 678739.2] metres
##                      (5476 x 4566 metres)
## Window area = 19873700 square metres
## Unit of length: 1 metre
## Fraction of frame area: 0.795

18.5 Activity

Capstone Activity

This is a capstone activity where you can work free-style on a data set of your choice, and put in practice what you have learned with respect to the analysis of point patterns.

  1. Partner with a fellow student to analyze the chosen dataset.

  2. Discuss whether the pattern is random, and how confident you are in your decision.

  3. The analysis of the pattern is meant to provide insights about the underlying process. Create a hypothesis using the data generated and can you answer that hypothesis using the plots generated?

  4. Discuss the limitations of the analysis, for instance, choice of modeling parameters (size of region, kernel bandwidths, edge effects, etc.)

(PART) Part IV: Data in Areal Units

19 Area Data I

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

19.1 Learning Objectives

In last few practices/sessions, you learned about spatial point patterns. The next few sessions will concentrate on area data.

In this practice, you will learn:

  1. A formal definition of area data.
  2. Processes and area data.
  3. Visualizing area data: Choropleth maps.
  4. Visualizing area data: Cartograms.

19.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

19.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(cartogram)
library(isdas)
library(gridExtra)
library(plotly)
library(sf)
library(tidyverse)

Read the data used in this chapter.

data("Hamilton_CT")

The data are an object of class sf that includes the spatial information for the census tracts in the Hamilton Census Metropolitan Area in Canada and a series of demographic variables from the 2011 Census of Canada.

You can quickly verify the contents of the dataframe by means of summary:

summary(Hamilton_CT)
##        ID               AREA             TRACT             POPULATION     POP_DENSITY         AGE_LESS_20    
##  Min.   : 919807   Min.   :  0.3154   Length:188         Min.   :    5   Min.   :    2.591   Min.   :   0.0  
##  1st Qu.: 927964   1st Qu.:  0.8552   Class :character   1st Qu.: 2639   1st Qu.: 1438.007   1st Qu.: 528.8  
##  Median : 948130   Median :  1.4157   Mode  :character   Median : 3595   Median : 2689.737   Median : 750.0  
##  Mean   : 948710   Mean   :  7.4578                      Mean   : 3835   Mean   : 2853.078   Mean   : 899.3  
##  3rd Qu.: 959722   3rd Qu.:  2.7775                      3rd Qu.: 4692   3rd Qu.: 3783.889   3rd Qu.:1110.0  
##  Max.   :1115750   Max.   :138.4466                      Max.   :11675   Max.   :14234.286   Max.   :3285.0  
##   AGE_20_TO_24    AGE_25_TO_29    AGE_30_TO_34     AGE_35_TO_39     AGE_40_TO_44     AGE_45_TO_49    AGE_50_TO_54  
##  Min.   :  0.0   Min.   :  0.0   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:168.8   1st Qu.:135.0   1st Qu.: 135.0   1st Qu.: 145.0   1st Qu.: 170.0   1st Qu.:203.8   1st Qu.:203.8  
##  Median :225.0   Median :215.0   Median : 195.0   Median : 200.0   Median : 230.0   Median :282.5   Median :280.0  
##  Mean   :253.9   Mean   :232.8   Mean   : 228.2   Mean   : 239.6   Mean   : 268.7   Mean   :310.6   Mean   :300.3  
##  3rd Qu.:311.2   3rd Qu.:296.2   3rd Qu.: 281.2   3rd Qu.: 280.0   3rd Qu.: 325.0   3rd Qu.:385.0   3rd Qu.:375.0  
##  Max.   :835.0   Max.   :915.0   Max.   :1320.0   Max.   :1200.0   Max.   :1105.0   Max.   :880.0   Max.   :740.0  
##   AGE_55_TO_59    AGE_60_TO_64  AGE_65_TO_69    AGE_70_TO_74    AGE_75_TO_79     AGE_80_TO_84     AGE_MORE_85    
##  Min.   :  0.0   Min.   :  0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:175.0   1st Qu.:140   1st Qu.:115.0   1st Qu.: 90.0   1st Qu.: 68.75   1st Qu.: 50.00   1st Qu.: 35.00  
##  Median :240.0   Median :220   Median :157.5   Median :130.0   Median :100.00   Median : 77.50   Median : 70.00  
##  Mean   :257.7   Mean   :229   Mean   :174.2   Mean   :139.7   Mean   :118.32   Mean   : 95.05   Mean   : 87.71  
##  3rd Qu.:325.0   3rd Qu.:295   3rd Qu.:221.2   3rd Qu.:180.0   3rd Qu.:160.00   3rd Qu.:120.00   3rd Qu.:105.00  
##  Max.   :625.0   Max.   :540   Max.   :625.0   Max.   :540.0   Max.   :575.00   Max.   :420.00   Max.   :400.00  
##           geometry  
##  POLYGON      :188  
##  epsg:26917   :  0  
##  +proj=utm ...:  0  
##                     
##                     
## 

19.4 Area Data

Every phenomena can be measured at a location (ask yourself, what exists outside of space?).

In point pattern analysis, the unit of support is the point, and the source of randomness is the location itself. Many other forms of data are also collected at points. For instance, when the census collects information on population, at its most basic, the information can be georeferenced to an address, that is, a point.

In numerous applications, however, data are not reported at their fundamental unit of support, but rather are aggregated to some other geometry, for instance an area. This is done for several reasons, including the privacy and confidentiality of the data. Instead of reporting individual-level information, the information is reported for zoning systems that often are devised without consideration to any underlying social, natural, or economic processes.

Census data, for example, are reported at different levels of geography. In Canada, the smallest publicly available geography is called a Dissemination Area or DA. A DA in Canada contains a population between 400 and 700 persons. Thus, instead of reporting that one person (or more) are located at a point (i.e., an address), the census reports the population for the DA. Other data are aggregated in similar ways (income, residential status, etc.)

At the highest level of aggregation, national level statistics are reported, such as Gross Domestic Product, or GDP. Economic production is not evenly distributed across space; however, the national GDP does not distinguish regional variations in this process.

Ideally, a data analyst would work with data in its most fundamental support. This is not always possible, and therefore many techniques have been developed to work with data that have been aggregated to zones.

When working with areas, it is less practical to identify the area with the coordinates (as we did with points). After all, areas will be composed of lines and reporting all the relevant coordinates is impractical. Sometimes the geometric centroids of the areas are used instead.

More commonly, areas are assigned an index or unique identifier, so that a region will typically consist of a set of \(n\) areas as follows: \[ R = A_1 \cup A_2 \cup A_3 \cup ...\cup A_n. \]

The above is read as “the Region R is the union of Areas 1 to n”.

Regions can have a set of \(k\) attributes or variables associated with them, for instance: \[ \textbf{X}_i=[x_{i1}, x_{i2}, x_{i3},...,x_{ik}] \]

These attributes will typically be counts (e.g., number of people in a DA), or some summary measure of the underlying data (e.g., mean commute time).

19.5 Processes and Area Data

Imagine that data on income by household were collected as follows:

# Here, we are creating a dataframe with three columns, coordinates x and y in space to indicate the locations of households, and their income.
df <- data.frame(x = c(0.3, 0.4, 0.5, 0.6, 0.7), y = c(0.1, 0.4, 0.2, 0.5, 0.3), Income = c(30000, 30000, 100000, 100000, 100000))

Households are geocoded as points with coordinates x and y, whereas income is in dollars.

Plot the income as points (hover over the points to see the attributes):

# The `ggplot()` function is used to create a plot. The function `geom_point()` adds points to the plot, using the values of coordinates x and y, and coloring by Income. Higher income households appear to be on the East regions of the area.

p <- ggplot(data = df, aes(x = x, y = y, color = Income)) + 
  geom_point(shape = 17, size = 5) +
  coord_fixed()
ggplotly(p)

The underlying process is one of income sorting, with lower incomes to the west, and higher incomes to the east. This could be due to a geographical feature of the landscape (for instance, an escarpment), or the distribution of the housing stock (with a neighborhood that has more expensive houses). These are examples of a variable that responds to a common environmental factor. As an alternative, people may display a preference towards being near others that are similar to them (this is called homophily). When this happens, the variable responds to itself in space.

The quality of similarity or dissimilarity between neighboring observations of the same variable in space is called spatial autocorrelation. You will learn more about this later on.

Another reason why variables reported for areas could display similarities in space is as an consequence of the zoning system.

Suppose for a moment that the data above can only be reported at the zonal level, perhaps because of privacy and confidentiality concerns. Thanks to the great talent of the designers of the zoning system (or a felicitous coincidence!), the zoning system is such that it is consistent with the underlying process of sorting. The zones, therefore, are as follows:

# Here, we create a new dataframe with the coordinates necessary to define two zones. The zones are rectangles, so we need to define four corners for each. "Zone_ID" only has 2 values because there are only two zones in the analysis. 

zones1 <- data.frame(x1=c(0.2, 0.45), x2=c(0.45, 0.80), y1=c(0.0, 0.0), y2=c(0.6, 0.6), Zone_ID = c('1','2'))

If you add these zones to the plot:

# Similar to the plot above, but adding the zones with `geom_rect()` for plotting rectangles.
p <- ggplot() + 
  geom_rect(data = zones1, mapping = aes(xmin = x1, xmax = x2, ymin = y1, ymax = y2, fill = Zone_ID), alpha = 0.3) + 
  geom_point(data = df, aes(x = x, y = y, color = Income), shape = 17, size = 5) +
  coord_fixed()
ggplotly(p)

What is the mean income in zone 1? What is the mean income in zone 2? Not only are the summary measures of income highly representative of the observations they describe, the two zones are also highly distinct.

Imagine now that for whatever reason (lack of prior knowledge of the process, convenience for data collection, etc.) the zones instead are as follows:

# Note how the values have changed for x1 and x2. This reveals that the zones have shifted and are no longer the same as the plot above. 

zones2 <- data.frame(x1=c(0.2, 0.55), x2=c(0.55, 0.80), y1=c(0.0, 0.0), y2=c(0.6, 0.6), Zone_ID = c('1','2'))

If you plot these zones:

p <- ggplot() + 
  geom_rect(data = zones2, mapping = aes(xmin = x1, xmax = x2, ymin = y1, ymax = y2, fill = Zone_ID), alpha = 0.3) + 
  geom_point(data = df, aes(x = x, y = y, color = Income), shape = 17, size = 5) +
  coord_fixed()
ggplotly(p)

What is now the mean income of zone 1? What is the mean income of zone 2? The observations have not changed, and the generating spatial process remains the same. However, as you can see, the summary measures for the two zones are more similar in this case than they were when the zones more closely captured the underlying process.

19.6 Visualizing Area Data: Choropleth Maps

The very first step when working with spatial area data, perhaps, is to visualize the data.

Commonly, area data are visualized by means of choropleth maps. A choropleth map is a map of the polygons that form the areas in the region, each colored in a way to represent the value of an underlying variable.

Lets use ggplot2 to create a choropleth map of population in Hamilton. Notice that the fill color for the polygons is given by cutting the values of POPULATION in five equal segments. In other words, the colors represent zones in the bottom 20% of population, zones in the next 20%, and so on, so that the darkest zones are those with populations so large as to be in the top 20% of the population distribution:

# Geographical information can also be plotted using `ggplot2` when it is in the form of simple features or `sf`. Here, we create a plot with function `ggplot()`. We also have available the census tracts for Hamilton in an `sf` dataframe. To plot the distribution of the population in five equal segments (or quintiles), we apply the function `cut_number()` to the variable `POPULATION` from the `Hamilton_CT` census tract dataframe. The aesthetic value for `fill` will color the zones according to the population quintiles.  

ggplot(Hamilton_CT) + 
  geom_sf(aes(fill = cut_number(POPULATION, 5)), color = NA, size = 0.1) +
  scale_fill_brewer(palette = "YlOrRd") +
  coord_sf() +
  labs(fill = "Population")

Inspect the map above. Would you say that the distribution of population is random, or not random? If not random, what do you think might be an underlying process for the distribution of population?

Often, creating a choropleth map using the absolute value of a variable can be somewhat misleading. As illustrated by the map of population by census tract in Hamilton, the zones with the largest population are also often large zones. Many process are confounded by the size of the zones: quite simply, in larger areas often there is more of, well, almost anything, compared with smaller areas. For this reason, it is often more informative when creating a choropleth map to use a variable that is a rate. Rates are quantities that are measured with respect to something. For instance population measured by area, or population density, is a rate:

# Note how the `cut_number()` is applied to population density rather than population like the figure above. This gives a more different, and perhaps more informative, of the distribution of population, by measuring population against area.

pop_den.map <- ggplot(Hamilton_CT) + 
  geom_sf(aes(fill = cut_number(POP_DENSITY, 5)), color = "white", size = 0.1) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(fill = "Pop Density")
pop_den.map

It can be seen now that the population density is higher in the more central parts of Hamilton, Burlington, Dundas, etc. Does the map look random? If not, what might be an underlying process that explains the variations in population density in a city like Hamilton?

Other times, it is appropriate to standardize instead of by area, by what might be called the population at risk. For instance, imagine that we wanted to explore the distribution of the population of older adults (say, 65 and older). In this case, if instead of normalizing by area, we used the total population instead, would remove the “size” effect, giving a rate:

#The "HAMILTON_CT" dataframe portions ages by category. For this choropleth map, we sum all age categories over 65, and then divide by total population. This measures the population of older adults against total population, to give a proportion (the rate out of a total). 

ggplot(Hamilton_CT) + 
  geom_sf(aes(fill = cut_number((AGE_65_TO_69 +
                                 AGE_70_TO_74 +
                                 AGE_75_TO_79 +
                                 AGE_80_TO_84 +
                                 AGE_MORE_85) / POPULATION, 5)),
          color = NA, 
          size = 0.1) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(fill = "Prop Age 65+")

Do you notice a pattern in the distribution of seniors in the Hamilton, CMA?

There are a few things to keep in mind when creating choropleth maps.

First, what classification scheme to use, with how many classes, and what colors?

The examples above were all created using a classification scheme based on the quintiles of the distribution. As noted above, these are obtained by dividing the sample into 5 equal parts to give bottom 20%, etc., of observations. The quintiles are a particular form of a statistical summary measure known as quantiles. Another example of a quantile is the median, which is the value obtained when the sample is divided in two equal sized parts. Other classification schemes may include the mean, standard deviations, and so on. Essentially, a classification scheme defines a way to divide the sample for representation in a choropleth map.

In terms of how many classes to use, often there is little point in using more than six or seven classes, because the human eye cannot distinguish color differences at a much higher resolution.

The colors are a matter of style and preference, but there are coloring schemes that are colorblind safe (see here). Also, for communication purposes, there are conventions that assign values or meanings to colors. Maps showing results of elections often use the colors of political parties: this is such a widespread convention that it would be thoroughly confusing if the colors were reversed, more so than if just the colors were exchanged for others. Red is often associated with heat, concentration, or sometimes bad, whereas green is associated with good. Here is an interesting discussion of use of colors in visualization.

Secondly, when the zoning system is irregular (as opposed to, say, a raster, which is composed of pixels, regular tiles of consistent size), large zones can easily become dominant. In effect, much detail in the maps above is lost for small zones, whereas large zones, especially if similarly colored, may mislead the eye as to their relative frequency.

Another mapping technique, the cartogram, is meant to reduce the issues with small-large zones.

19.7 Visualizing Area Data: Cartograms

A cartogram is a map where the size of the zones is adjusted so that instead of being the surface area, it is proportional to some other variable of interest.

We will illustrate the idea behind the cartogram here.

In the maps that we created above, the zones are faithful to their geographical properties (subject to distortions due to geographical projection). Unfortunately, this feature of the maps obscured the relevance of some of the smaller zones. A cartogram can be weighted by another variable, say for instance, the population. In this way, the size of the zones will depend on the total population.

Cartograms are implemented in R in the package cartogram.

# The function `cartogram_cont()` constructs a continuous area cartogram. Here, a cartogram is created for census tracts of the city of Hamilton, but the size of the zones will be weighted by the variable `POPULATION`.
CT_pop_cartogram <- cartogram_cont(Hamilton_CT, weight = "POPULATION")
## Mean size error for iteration 1: 5.93989832705674
## Mean size error for iteration 2: 4.5514055520835
## Mean size error for iteration 3: 7.74856106866916
## Mean size error for iteration 4: 7.49510294164283
## Mean size error for iteration 5: 5.12121781701006
## Mean size error for iteration 6: 3.45188989405368
## Mean size error for iteration 7: 2.66683855570118
## Mean size error for iteration 8: 2.23950467189881
## Mean size error for iteration 9: 1.93816581350794
## Mean size error for iteration 10: 1.78377894897916
## Mean size error for iteration 11: 1.62985317085302
## Mean size error for iteration 12: 1.50983288572639
## Mean size error for iteration 13: 1.60808238152904
## Mean size error for iteration 14: 6.67220825006972
## Mean size error for iteration 15: 8.78821301683394

Plotting the cartogram:

#We are using "ggplot" to create a cartogram for populations by census tact in Hamilton. Census tracts with a larger value are distorted to visually represent their population size. The number "5" after calling the population variable states that there will be 5 categories dividing population quantities.
ggplot(CT_pop_cartogram) + 
  geom_sf(aes(fill = cut_number(POPULATION, 5)), color = "white", size = 0.1) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(fill = "Population")

Notice how the size of the zones has been adjusted.

The cartogram can be combined with coloring schemes, as in choropleth maps:

CT_popden_cartogram <- cartogram_cont(Hamilton_CT, weight = "POP_DENSITY")
## Mean size error for iteration 1: 29.0384287070147
## Mean size error for iteration 2: 26.6652279985395
## Mean size error for iteration 3: 24.8111000080233
## Mean size error for iteration 4: 23.2716548947531
## Mean size error for iteration 5: 21.928598879704
## Mean size error for iteration 6: 20.7113138849207
## Mean size error for iteration 7: 19.576698518681
## Mean size error for iteration 8: 18.4983401508171
## Mean size error for iteration 9: 17.460238779898
## Mean size error for iteration 10: 16.453534698246
## Mean size error for iteration 11: 15.4732800316789
## Mean size error for iteration 12: 14.5184813061204
## Mean size error for iteration 13: 13.5901475440423
## Mean size error for iteration 14: 12.6911089325245
## Mean size error for iteration 15: 11.8246511070686

Plot the cartogram:

pop_den.cartogram <- ggplot(CT_popden_cartogram) + 
  geom_sf(aes(fill = cut_number(POP_DENSITY, 5)),color = "white", size = 0.1) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(fill = "Pop Density")
pop_den.cartogram

By combining a cartogram with choropleth mapping, it becomes easier to appreciate the way high population density is concentrated in the central parts of Hamilton, Burlington, etc.

grid.arrange(pop_den.map, pop_den.cartogram, nrow = 2)

This concludes this chapter.

20 Activity 9: Area Data I

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

20.1 Practice questions

Answer the following questions:

  1. What is a key difference between area data and point data?
  2. What is a choropleth map?
  3. What is a cartogram?
  4. What are the advantages and disadvantages of these mapping techniques?

20.2 Learning objectives

In this activity, you will:

  1. Create choroplet maps using census data.
  2. Think about possible underlying process that could explain the pattern.
  3. Think about ways to decide whether a landscape is random when working with area data.

20.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

20.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity.

In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn more about this package here):

library(tidyverse)
library(sf)
library(cartogram)
library(isdas)

In the practice that preceded this activity, you learned about the area data and visualization techniques for area data.

Begin by loading the data that you will use in this activity:

data("Hamilton_CT")

This is an sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada.

You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49:

Hamilton_CT <- Hamilton_CT %>%
  mutate(Prop20to34 = (AGE_20_TO_24 + 
                         AGE_25_TO_29 + 
                         AGE_30_TO_34)/POPULATION, 
         Prop35to49 = (AGE_35_TO_39 + 
                         AGE_40_TO_44 + 
                         AGE_45_TO_49)/POPULATION)

You are ready for the next activity.

20.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Create choropleth maps for the proportion of the population who are 20 to 34 years old, 35 to 49 years old, 50 to 65 years old, and 65 and older.

  2. Create cartograms for the proportion of the population who are 20 to 34 years old, 35 to 49 years old, 50 to 65 years old, and 65 and older.

  3. Change the scheme and colors of your maps to obtain maps with 2 classes/colors, 5 classes/colors, and 10 classes/colors. You can check different color palettes in the documentation of {ggplot2}. Which scheme is more informative? What colors looked better to you?

Activity Part II

  1. Show your maps to a fellow student. What patterns do you notice in the distribution of population by age in Hamilton? Do you think the distribution of the population by age is random, or not random?

  2. Devise a rule to decide whether the pattern observed in a choropleth map is random.

21 Area Data II

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

21.1 Learning Objectives

In last chapter and activity, you learned about area data and practiced some visualization techniques for spatial data of this type, specifically choropleth maps and cartograms. You also thought about rules to decide whether a mapped variable displayed a spatially random distribution of values.

In this practice, you will learn about:

  1. The concept of proximity for area data.
  2. How to formalize the concept of proximity: spatial weights matrices.
  3. How to create spatial weights matrices in R.
  4. The use of spatial moving averages.
  5. Other criteria for coding proximity.

21.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

21.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(plotly)
library(sf)
library(spdep)
library(tidyverse)

Read the data to be used in this chapter. The data is an object of class sf (simple feature) with the census tracts of Hamilton CMA in Canada, and a selection of demographic variables:

data(Hamilton_CT)

You can quickly verify the contents of the dataframe by means of summary:

summary(Hamilton_CT)
##        ID               AREA             TRACT             POPULATION     POP_DENSITY         AGE_LESS_20    
##  Min.   : 919807   Min.   :  0.3154   Length:188         Min.   :    5   Min.   :    2.591   Min.   :   0.0  
##  1st Qu.: 927964   1st Qu.:  0.8552   Class :character   1st Qu.: 2639   1st Qu.: 1438.007   1st Qu.: 528.8  
##  Median : 948130   Median :  1.4157   Mode  :character   Median : 3595   Median : 2689.737   Median : 750.0  
##  Mean   : 948710   Mean   :  7.4578                      Mean   : 3835   Mean   : 2853.078   Mean   : 899.3  
##  3rd Qu.: 959722   3rd Qu.:  2.7775                      3rd Qu.: 4692   3rd Qu.: 3783.889   3rd Qu.:1110.0  
##  Max.   :1115750   Max.   :138.4466                      Max.   :11675   Max.   :14234.286   Max.   :3285.0  
##   AGE_20_TO_24    AGE_25_TO_29    AGE_30_TO_34     AGE_35_TO_39     AGE_40_TO_44     AGE_45_TO_49    AGE_50_TO_54  
##  Min.   :  0.0   Min.   :  0.0   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:168.8   1st Qu.:135.0   1st Qu.: 135.0   1st Qu.: 145.0   1st Qu.: 170.0   1st Qu.:203.8   1st Qu.:203.8  
##  Median :225.0   Median :215.0   Median : 195.0   Median : 200.0   Median : 230.0   Median :282.5   Median :280.0  
##  Mean   :253.9   Mean   :232.8   Mean   : 228.2   Mean   : 239.6   Mean   : 268.7   Mean   :310.6   Mean   :300.3  
##  3rd Qu.:311.2   3rd Qu.:296.2   3rd Qu.: 281.2   3rd Qu.: 280.0   3rd Qu.: 325.0   3rd Qu.:385.0   3rd Qu.:375.0  
##  Max.   :835.0   Max.   :915.0   Max.   :1320.0   Max.   :1200.0   Max.   :1105.0   Max.   :880.0   Max.   :740.0  
##   AGE_55_TO_59    AGE_60_TO_64  AGE_65_TO_69    AGE_70_TO_74    AGE_75_TO_79     AGE_80_TO_84     AGE_MORE_85    
##  Min.   :  0.0   Min.   :  0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:175.0   1st Qu.:140   1st Qu.:115.0   1st Qu.: 90.0   1st Qu.: 68.75   1st Qu.: 50.00   1st Qu.: 35.00  
##  Median :240.0   Median :220   Median :157.5   Median :130.0   Median :100.00   Median : 77.50   Median : 70.00  
##  Mean   :257.7   Mean   :229   Mean   :174.2   Mean   :139.7   Mean   :118.32   Mean   : 95.05   Mean   : 87.71  
##  3rd Qu.:325.0   3rd Qu.:295   3rd Qu.:221.2   3rd Qu.:180.0   3rd Qu.:160.00   3rd Qu.:120.00   3rd Qu.:105.00  
##  Max.   :625.0   Max.   :540   Max.   :625.0   Max.   :540.0   Max.   :575.00   Max.   :420.00   Max.   :400.00  
##           geometry  
##  POLYGON      :188  
##  epsg:26917   :  0  
##  +proj=utm ...:  0  
##                     
##                     
## 

21.4 Proximity in Area Data

In the earlier part of the text, when working with point data, the spatial relationships among events (their proximity) were more or less unambiguously given by their relative location, or more precisely by their distance. Hence, we had quadrat-based techniques (relative location with respect to a grid), kernel density (relative location with respect to the center of a kernel function), and distance-based techniques (event-to-event and point-to-event distances).

In the case of area data, spatial proximity can be represented in more ways, given the characteristics of areas. In particular, an area contains an infinite number of points, and measuring distance between two areas leads to an infinite number of results, depending on which pairs of points within two zones are used to measure the distance.

Consider the simple zonal system shown in Figure @ref{fig:simple-zoning-system}. Which of zones \(A_2\), \(A_3\), and \(A_4\) is closer (or more proximate) to \(A_1\)?

\label{fig:simple-zoning-system}Simple zoning system

(#fig:simple-zoning-system)Simple zoning system

We can devise a way of establishing proximity between areas as follows: if points are selected in such a way that they are on the overlapping edges of two contiguous areas, the distance between these two areas clearly is zero, and they must be proximate.

This criterion to define proximity is called adjacency. Adjacency means that two zones share a common edge. This is conventionally called the rook criterion, after chess, in which the piece called the rook can move only orthogonally (in the vertical and horizontal directions). The rook criterion, however, would dictate that zones \(A_2\) and \(A_6\) are not proximate, despite being closer than \(A_2\) and \(A_3\).

When this criterion is expanded to allow contact at a single point between zones (say, the corner between \(A_2\) and \(A_6\)), the adjacency criterion is called queen, again, for the chess piece that moves both orthogonally and diagonally.

If we accept adjacency as a reasonable way of expressing relationships of proximity between areas, what we need is a way of coding relationships of adjacency in a way that is convenient and amenable to manipulation for data analysis.

One of the most widely used tools to code proximity in area data is the spatial weights matrix.

21.5 Spatial Weights Matrices

A spatial weights matrix is an arrangement of values (or weights) for all pairs of zones in a system. For instance, in a zoning system such as shown in Figure 1, with 6 zones, there will be \(6 \times 6\) such weights. The weights are organized by rows, in such a way that each zone has a corresponding row of weights. For example, zone \(A_1\) in Figure 1 has the following weights, one for each zone in the system: \[ w_{1\cdot} = [w_{11}, w_{12}, w_{13}, w_{14}, w_{15}, w_{16}] \]

The values of the weights depend on the adjacency criterion adopted. The simplest coding scheme is when we assign a value of 1 to pairs of zones that are adjacent, and a value of 0 to pairs of zones that are not.

Lets formalize the two criteria mentioned above:

  • Rook criterion

\[ w_{ij}=\bigg\{\begin{array}{l l} 1\text{ if } A_i \text{ and } A_j \text{ share an edge}\\ 0\text{ otherwise}\\ \end{array} \] If rook adjacency is used, the weights for zone \(A_6\) are as follows: \[ w_{6\cdot} = [0, 0, 0, 1, 1, 0]. \]

As you can see, the adjacent areas from the perspective of \(A_6\) are \(A_4\) and \(A_5\) by virtue of sharing an edge. These two areas receive weights of 1. On the other hand, \(A_1\), \(a_2\), and \(A_3\) are not adjacent, and therefore receive a weight of zero. Notice how the weight \(w_{66}\) is set to zero. By convention, an area is not its own neighbor!

  • Queen criterion

\[ w_{ij}=\bigg\{\begin{array}{l l} 1\text{ if } A_i \text{ and } A_j \text{ share an edge or a vertex}\\ 0\text{ otherwise}\\ \end{array} \]

If queen adjacency is used, the weights for zone \(A_6\) are as follows: \[ w_{6\cdot} = [0, 1, 0, 1, 1, 0]. \]

As you can see, the adjacent areas from the perspective of \(A_6\) are \(A_4\) and \(A_5\) (by virtue of sharing an edge), and \(A_2\) (by virtue of sharing a vertex). These three areas receive weights of 1. On the other hand, \(A_1\) and \(A_3\) are not adjacent, and therefore receive a weight of zero. Again, weight \(w_{66}\) is set to zero.

The set of weights above define the neighborhood of \(A_6\).

The spatial weights matrix for the whole system in Figure 1 is as follows: \[ \textbf{W}=\left (\begin{array}{c c c c c c} 0 & 1 & 1 & 1 & 0 & 0\\ 1 & 0 & 0 & 1 & 1 & 1\\ 1 & 0 & 0 & 1 & 0 & 0\\ 1 & 1 & 1 & 0 & 1 & 1\\ 0 & 1 & 0 & 1 & 0 & 1\\ 0 & 1 & 0 & 1 & 1 & 0\\ \end{array} \right). \]

Compare the matrix to the zoning system. The spatial weights matrix has the following properties:

  1. The main diagonal elements of the matrix are all zeros (no area is its own neighbor).

  2. Each zone has a row of weights in the matrix: row number one corresponds to \(A_1\), row number two corresponds to \(A_2\), and so on.

  3. Likewise, each zone has a column of weights.

  4. The sum of all values in a row gives the total number of neighbors for a zone. That is: \[ \text{The total number of neighbors of } A_i \text{ is given by: }\sum_{j=1}^{n}{w_{ij}} \]

The spatial weights matrix is often processed to obtain a row-standardized spatial weights matrix. This procedure consists of dividing every weight by the sum of its corresponding row (i.e., by the total number of neighbors of the zone), as follows: \[ w_{ij}^{st}=\frac{w_{ij}}{\sum_{j=1}^n{w_{ij}}} \]

The row-standardized weights matrix for the system in Figure 1 is: \[ \textbf{W}^{st}=\left (\begin{array}{c c c c c c} 0 & 1/3 & 1/3 & 1/3 & 0 & 0\\ 1/4 & 0 & 0 & 1/4 & 1/4 & 1/4\\ 1/2 & 0 & 0 & 1/2 & 0 & 0\\ 1/5 & 1/5 & 1/5 & 0 & 1/5 & 1/5\\ 0 & 1/3 & 0 & 1/3 & 0 & 1/3\\ 0 & 1/3 & 0 & 1/3 & 1/3 & 0\\ \end{array} \right). \]

The row-standardized spatial weights matrix has the following properties:

  1. Each weight now represents the proportion of a neighbor out of the total of neighbors. For instance, since the total of neighbors of \(A_1\) is 3, each neighbor contributes 1/3 to that total.

  2. The sum of all weights over a row equals 1, or 100% of all neighbors for that zone.

21.6 Creating Spatial Weights Matrices in R

Coding spatial weights matrices by hand is a tedious and error-prone process. Fortunately, functions to generate them exist in R. The package spdep in particular has a number of useful utilities for working with spatial weights matrices.

The first step to create a spatial weights matrix is to find the neighbors (i.e., areas adjacent to) for each area. The function poly2nb is used for this. The input argument is a SpatialPolygonDataFrame, a kind of object that spdep uses. Fortunately, it is straightforward to convert our sf object into a SpatialPolygonDataFrame by means of the function as():

# Function `as()` is used to convert between object classes

Hamilton_CT.sp <- as(Hamilton_CT, "Spatial")

The following finds the neighbors (note that the default adjacency criterion is queen):

# The function `poly2nb()` takes an object of class "Spatial" with polygons, and finds the neighbors

Hamilton_CT.nb <- poly2nb(pl = Hamilton_CT.sp, queen = TRUE)

The value (output) of the function is an object of class nb:

class(Hamilton_CT.nb)
## [1] "nb"

The function summary() applied to an object of this class gives some useful information about the neighbors in the region, including the number of zones in this system (\(188\)), the total number of neighbors (\(1,180\)), and the percentage of neighbors out of all pairs of areas (3.34%; conversely, 96.66% of all possible zone pairs are not neighbors!) Other information includes the distribution of neighbors (3 zones have two neighbors, 8 zones have three neighbors, 22 zones have four neighbors, and so on):

summary(Hamilton_CT.nb)
## Neighbour list object:
## Number of regions: 188 
## Number of nonzero links: 1180 
## Percentage nonzero weights: 3.338615 
## Average number of links: 6.276596 
## Link number distribution:
## 
##  2  3  4  5  6  7  8  9 10 11 12 14 
##  3  8 22 32 35 45 30  6  1  1  4  1 
## 3 least connected regions:
## 174 175 188 with 2 links
## 1 most connected region:
## 33 with 14 links

The nb object is a list that contains the neighbors for each zone. For instance, the neighbors of census tract 5370001.01 (the first tract in the dataframe) are the following tracts:

# Here, the indexing works by making reference to the first set of zone in `Hamilton_CT.nb` and then using those values to retrieve the census tract identifiers from our `Hamilton_CT` dataframe

Hamilton_CT$TRACT[Hamilton_CT.nb[[1]]]
## [1] "5370120.02" "5370122.01" "5370122.02" "5370124.00" "5370142.01" "5370133.01" "5370130.03"

The list of neighbors can be converted into a list of entries in a spatial weights matrix \(W\) by means of the function nb2listw (for “neighbors to matrix W in list form”):

Hamilton_CT.w <- nb2listw(Hamilton_CT.nb)

We can visualize the neighbors (adjacent) areas:

plot(Hamilton_CT.sp, border = "gray")
plot(Hamilton_CT.nb, coordinates(Hamilton_CT.sp), col = "red", add = TRUE)

21.7 Spatial Moving Averages

The spatial weights matrix \(W\), and in particular its row-standardized version \(W^{st}\), is useful to calculate a spatial statistic, the spatial moving average.

The spatial moving average is a variation of the mean statistic: in fact, it is a weighted average, calculated using the spatial weights. Recall that the mean is calculated as the sum of all relevant values divided by the number of values summed. In the case of spatial data, the mean is what we would call a global statistic, since it is calculated using all data for a region: \[ \bar{x}=\frac{1}{n}\sum_{j=1}^{n}{x_j} \] where \(\bar{x}\) (read x-bar) is the mean of all values of x.

A spatial moving average is calculated in the same way, but for each area, and based only on the values of proximate areas: \[ \bar{x_i}=\frac{1}{n_i}\sum_{j\in N(i)}{x_j} \] where \(n_i\) is the number of neighbors of \(A_i\), and the sum is only for \(x_j\) that are in the neighborhood of i (\(j\in N(i)\) is read “j in the neighborhood of i”).

We can illustrate the way spatial moving averages work by making reference again to Figure 1.

Consider zone \(A_1\). The spatial weights matrix indicates that the neighborhood of \(A_1\) consists of three areas: \(A_2\), \(A_3\), and \(A_4\). Therefore \(n_1=3\), and \(j\in N(1)\) are 2, 3, and 4.

The spatial moving average of \(A_1\) for a variable \(x\) would then be calculated as: \[ \bar{x}_1=\frac{x_2 + x_3 + x_4}{3} \]

Notice that another way of writing the spatial moving average expression is as follows, since membership in the neighborhood of \(i\) is implicit in the definition of \(w_{ij}\)! Since \(w_{ij}\) takes values of zero and one, the effect is to turn on and off the values of \(x\) depending on whether they are for areas adjacent to \(i\): \[ \bar{x}_i=\frac{1}{n_i}\sum_{j=1}^n{w_{ij}x_j} \]

This means that the spatial moving average of \(A_1\) for a variable \(x\) on this system can also be calculated using the spatial weights matrix as: \[ \bar{x}_1=\frac{w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + w_{14}x_4 + w_{15}x_5 + w_{12}x_6}{3} \]

Substituting the spatial weights: \[ \bar{x}_1=\frac{0x_1 + 1x_2 + 1x_3 + 1x_4 + 0x_5 + 0x_6}{3} = \frac{x_2 + x_3 + x_4}{3} \]

In other words, the spatial weights can be used directly in the calculation of spatial moving averages.

Further, notice that: \[ n_i=\sum_{j=1}^{n}w_{ij} \] which is simply the total number of neighbors of \(A_i\), and the value we used to row-standardize the spatial weights.

Since the row-standardized weights have already been divided by the number of neighbors, we can use them to express the spatial moving average as follows: \[ \bar{x}_i=\sum_{j=1}^{n}{w_{ij}^{st}x_j} \]

Continuing with this example, if we use the row-standardized weights, the spatial moving average at \(A_1\) is: \[ \bar{x}_i=0x_1 + \frac{1}{3}x_2 + \frac{1}{3}x_3 + \frac{1}{3}x_4 + 0x_5 + 0x_6 \] which is the same as: \[ \bar{x}_i=\frac{x_2 + x_3 + x_4}{3} \]

Consider the following map of Hamilton’s population density:

# You have seen previously how to create a choropleth map using quintiles. The first part of this is a choropleth map of population density

map <- ggplot(data = Hamilton_CT) +
  geom_sf(aes(fill = cut_number(Hamilton_CT$POP_DENSITY, 5), 
                   POP_DENSITY = round(POP_DENSITY),
                   TRACT = TRACT), 
               color = "black") +
  # For the example, two census tracts will be identified more explicitly
  # Next, we the function `filter()` to select census tract 5370142.02. We will color red the boundaries of this census tract 
  geom_sf(data = filter(Hamilton_CT, TRACT == "5370142.02"), 
               aes(POP_DENSITY = round(POP_DENSITY),
                   TRACT = TRACT), 
               color = "red",
               weight = 3, fill = NA) +
    # We the function `filter()` again, now to select census tract 5370144.01. We will color green the boundaries of this census tract 
  geom_sf(data = subset(Hamilton_CT, TRACT == "5370144.01"), 
               aes(POP_DENSITY = round(POP_DENSITY),
                   TRACT = TRACT), 
               color = "green",
               weight = 3, fill = NA) +
  # This selects a palette for the fill colors and changes the label for the legend
  scale_fill_brewer(palette = "YlOrRd") +
  labs(fill = "Pop Density") +
  coord_sf()

# The function `ggplotly()` takes a `ggplot2` object and creates an interactive map
ggplotly(map, tooltip = c("TRACT", "POP_DENSIT"))

Manually calculate the spatial moving average for tract 5370142.02 (with the red boundary) and tract (with the green boundary). Tip: hover over the tracts to see their population densities.

(32 + 109 + 48)/3
## [1] 63
(48 + 55 + 125)/3
## [1] 76

Spatial moving averages can be calculated in a straightforward way by means of the function lag.listw() function of the spdep package. This function uses a spatial weights matrix and automatically selects the row-standardized weights.

Here, we calculate the spatial moving average of population density:

POP_DENSITY.sma <- lag.listw(x = Hamilton_CT.w, Hamilton_CT$POP_DENSITY)

And now we can plot the spatial moving average of population density. First we join this variable to our sf dataframe with the census tracts. The key for joining the two dataframes is the unique tract identifier :

Hamilton_CT <- left_join(Hamilton_CT, data.frame(TRACT = Hamilton_CT$TRACT, POP_DENSITY.sma), by = "TRACT")

And plot:

# In this chunk of code we create a choropleth map, but now of the spatial moving average of population density

# First map the spatial moving average of population density using quintiles
map.sma <- ggplot() +
  geom_sf(data = Hamilton_CT,
          aes(fill = cut_number(Hamilton_CT$POP_DENSITY.sma, 5),
              POP_DENSITY.sma = round(POP_DENSITY.sma),
              TRACT = TRACT),
          color = "black") +
  # Select and plot census tract 5370142.02 and color its boundaries in red
  geom_sf(data = filter(Hamilton_CT, TRACT == "5370142.02"), 
          aes(POP_DENSITY.sma = round(POP_DENSITY.sma),
              TRACT = TRACT), 
          color = "red",
          weight = 3, fill = NA) +
  # Select and plot census tract 5370144.01 and color its boundaries in green
  geom_sf(data = filter(Hamilton_CT, TRACT == "5370144.01"), 
          aes(POP_DENSITY.sma = round(POP_DENSITY.sma),
              TRACT = TRACT), 
          color = "green",
          weight = 3, fill = NA) +
  # Embellish the map with a color palette to your taste and labels
  scale_fill_brewer(palette = "YlOrRd") +
  labs(fill = "Pop Density SMA") +
  coord_sf()

# Again, `ggplotly()` takes the `ggplot2` object and creates an interactive map
ggplotly(map.sma, tooltip = c("TRACT", "POP_DENSIT.sma"))

Verify that your manual calculations for the two tracts above are correct. What differences do you notice between the map of population density and the map of spatial moving averages of population density?

21.8 Other Criteria for Coding Proximity

Adjacency is not the only criterion that can be used for coding proximity.

Occasionally, the distance between areas is calculated by using the centroids of the areas as their representative points. A centroid is simply the mean of the coordinates of the edges of an area, and in this way represent the “center of gravity” of the area.

The inter-centroid distance allows us to define additional criteria for proximity, including neighbors within a certain distance threshold, and \(k\)-nearest neighbors.

  • Distance-based criterion

\[ w_{ij}=\bigg\{\begin{array}{l l} 1\text{ if inter-centroid distance } d_{ij}\leq \delta\\ 0\text{ otherwise}\\ \end{array} \] where \(\delta\) is a distance threshold.

Distance-based nearest neighbors can be obtained in R by means of the function dnearneigh().

To implement this criterion we need to find the centroids of the polygons with st_centroid() and then extract the coordinates of the centroids with st_coordinates():

CT_centroids <- st_centroid(Hamilton_CT) %>% 
  st_coordinates()
## Warning in st_centroid.sf(Hamilton_CT): st_centroid assumes attributes are constant over geometries of x

We can create a nearest neighbors object nb using two threshold distances, a minimum and a maximum distance value. In this example we will consider that the neighbors of zone \(A_i\) are all zones \(A_j\) whose centroids are within \(0\) and \(5,000\) meters of the centroid of \(A_i\):

Hamilton_CT.dnb <- dnearneigh(CT_centroids, d1 = 0, d2 = 5000)

We can visualize the neighbors (adjacent) areas:

plot(Hamilton_CT.sp, border = "gray")
plot(Hamilton_CT.dnb, CT_centroids, col = "red", add = TRUE)

Try changing the distance threshold to see how different neighborhoods are defined.

  • \(k\)-nearest neighbors

A potential disadvantage of using a distance-based criterion is that for zoning systems with areas of vastly different sizes, small areas will end up having many neighbors, whereas large areas will have few or none.

The criterion of \(k\)-nearest neighbors allows for some adaptation to the size of the areas. Under this criterion, all zones have the exact same number of neighbors, but the geographical extent of the neighborhood may (and likely will) change. The criterion is defined as follows: \[ w_{ij}=\bigg\{\begin{array}{l l} 1\text{ if } A_j \text{ is one of } k\text{-nearest neighbors of } A_i\\ 0\text{ otherwise}\\ \end{array} \]

In R, \(k\)-nearest neighbors can be obtained by means of the function knearneigh(), and the arguments include the value of \(k\):

Hamilton_CT.knb <- knn2nb(knearneigh(CT_centroids, k = 3))

We can again visualize the neighbors (“adjacent”) areas:

plot(Hamilton_CT.sp, border = "gray")
plot(Hamilton_CT.knb, CT_centroids, col = "red", add = TRUE)

Try changing the value of k to see how the neighborhoods change.

This chapter has equipped you to define various forms of proximity for area data. You have also seen how spatial moving averages can be calculated using row-standardized spatial weights matrices.

22 Activity 10: Area Data II

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

22.1 Practice questions

Answer the following questions:

  1. List and describe two criteria to define proximity in area data analysis.
  2. What is a spatial weights matrix?
  3. Why do spatial weight matrices have zeros in the main diagonal?
  4. How is a spatial weights matrix row-standardized?
  5. Write the spatial weights matrices for the sample systems in Figures @ref{fig:simple-areal-system-i} and @ref{fig:simple-areal-system-ii}. Explain the criteria used to do so.
\label{fig:simple-areal-system-i}Sample areal system 1

(#fig:simple-areal-system-i)Sample areal system 1

\label{fig:simple-areal-system-ii}Sample areal system 2

(#fig:simple-areal-system-ii)Sample areal system 2

22.2 Learning objectives

In this activity, you will:

  1. Create spatial weights matrices.
  2. Calculate the spatial moving average of a variable.
  3. Create scatterplots of a variable and its spatial moving average.
  4. Think about ways to decide whether a landscape is random when working with area data.

22.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

22.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity.

In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn about sf here) and spdep, a package that implements several spatial statistical methods (you can learn more about it here):

library(isdas)
library(plotly)
library(sf)
library(spdep)
library(tidyverse)

In the practice that preceded this activity, you learned about the area data and visualization techniques for area data.

Begin by loading the data that you will use in this activity:

data(Hamilton_CT)

This is a sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada.

You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49:

Hamilton_CT <- Hamilton_CT %>%
  mutate(Prop20to34 = (AGE_20_TO_24 + AGE_25_TO_29 + AGE_30_TO_34)/POPULATION,
         Prop35to49 = (AGE_35_TO_39 + AGE_40_TO_44 + AGE_45_TO_49)/POPULATION)

You can also convert the sf object into a SpatialPolygonsDataFrame object for use with the spdedp package:

Hamilton_CT.sp <- as(Hamilton_CT, "Spatial")

You are now ready for the next activity.

22.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Create a spatial weights matrix for the census tracts in the Hamilton CMA. Use adjacency as your criterion for proximity.

  2. Calculate the spatial moving average for the following two variables: 1) proportion of the population who are 20 to 34 years old; and 2) proportion of the population who are 65 and older.

  3. Append the spatial moving averages to your dataframe.

  4. Choose one age group and create a scatterplot of the proportion of population in that group versus its spatial moving average. (Hint: if you create the scatterplot using ggplot2 you can add the 45 degree line by means of geom_abline(slope = 1, intercept = 0)).

Activity Part II

  1. Show your scatterplot of the population versus its spatial moving average to a fellow student. What is the meaning of the 45 degree line in this plot?

  2. Create a null-landscape by scrambling the values of your variable. For instance, you can use the variable prop20to34 to generate a null landscape as follows:

Hamilton_CT$Null_1 <- sample(Hamilton_CT$Prop20to34)
  1. Calculate the spatial moving average of your null landscape, and create a scatterplot just like you did for your variable. How is this scatterplot different?

23 Area Data III

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

23.1 Learning Objectives

In the previous chapter and its corresponding activity, you learned about different ways to define proximity for area data, about spatial weights matrices, and how spatial weights matrices could be used to calculate spatial moving averages.

In this practice, you will learn about:

  1. Spatial moving averages and simulation.
  2. The concept of spatial autocorrelation.
  3. Moran’s \(I\) coefficient and Moran’s scatterplot.
  4. Hypothesis testing for spatial autocorrelation.

23.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

23.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(gridExtra)
library(patchwork)
library(spdep)
library(sf)
library(tidyverse)

Read the data used in this chapter. This is an object of class sf (simple feature) with the census tracts of Hamilton CMA and some selected population variables from the 2011 Census of Canada:

data(Hamilton_CT)

You can quickly verify the contents of the dataframe by means of summary:

summary(Hamilton_CT)
##        ID               AREA             TRACT             POPULATION     POP_DENSITY         AGE_LESS_20    
##  Min.   : 919807   Min.   :  0.3154   Length:188         Min.   :    5   Min.   :    2.591   Min.   :   0.0  
##  1st Qu.: 927964   1st Qu.:  0.8552   Class :character   1st Qu.: 2639   1st Qu.: 1438.007   1st Qu.: 528.8  
##  Median : 948130   Median :  1.4157   Mode  :character   Median : 3595   Median : 2689.737   Median : 750.0  
##  Mean   : 948710   Mean   :  7.4578                      Mean   : 3835   Mean   : 2853.078   Mean   : 899.3  
##  3rd Qu.: 959722   3rd Qu.:  2.7775                      3rd Qu.: 4692   3rd Qu.: 3783.889   3rd Qu.:1110.0  
##  Max.   :1115750   Max.   :138.4466                      Max.   :11675   Max.   :14234.286   Max.   :3285.0  
##   AGE_20_TO_24    AGE_25_TO_29    AGE_30_TO_34     AGE_35_TO_39     AGE_40_TO_44     AGE_45_TO_49    AGE_50_TO_54  
##  Min.   :  0.0   Min.   :  0.0   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:168.8   1st Qu.:135.0   1st Qu.: 135.0   1st Qu.: 145.0   1st Qu.: 170.0   1st Qu.:203.8   1st Qu.:203.8  
##  Median :225.0   Median :215.0   Median : 195.0   Median : 200.0   Median : 230.0   Median :282.5   Median :280.0  
##  Mean   :253.9   Mean   :232.8   Mean   : 228.2   Mean   : 239.6   Mean   : 268.7   Mean   :310.6   Mean   :300.3  
##  3rd Qu.:311.2   3rd Qu.:296.2   3rd Qu.: 281.2   3rd Qu.: 280.0   3rd Qu.: 325.0   3rd Qu.:385.0   3rd Qu.:375.0  
##  Max.   :835.0   Max.   :915.0   Max.   :1320.0   Max.   :1200.0   Max.   :1105.0   Max.   :880.0   Max.   :740.0  
##   AGE_55_TO_59    AGE_60_TO_64  AGE_65_TO_69    AGE_70_TO_74    AGE_75_TO_79     AGE_80_TO_84     AGE_MORE_85    
##  Min.   :  0.0   Min.   :  0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:175.0   1st Qu.:140   1st Qu.:115.0   1st Qu.: 90.0   1st Qu.: 68.75   1st Qu.: 50.00   1st Qu.: 35.00  
##  Median :240.0   Median :220   Median :157.5   Median :130.0   Median :100.00   Median : 77.50   Median : 70.00  
##  Mean   :257.7   Mean   :229   Mean   :174.2   Mean   :139.7   Mean   :118.32   Mean   : 95.05   Mean   : 87.71  
##  3rd Qu.:325.0   3rd Qu.:295   3rd Qu.:221.2   3rd Qu.:180.0   3rd Qu.:160.00   3rd Qu.:120.00   3rd Qu.:105.00  
##  Max.   :625.0   Max.   :540   Max.   :625.0   Max.   :540.0   Max.   :575.00   Max.   :420.00   Max.   :400.00  
##           geometry  
##  POLYGON      :188  
##  epsg:26917   :  0  
##  +proj=utm ...:  0  
##                     
##                     
## 

23.4 Spatial Moving Averages and Simulation

In the preceding chapter and activity you learned about different criteria to define proximity for the analysis of area data, and how spatial weights matrices can be used to code patterns of proximity among zones in a spatial system. Furthermore, you also saw how spatial weights matrices can be used to calculate spatial moving averages, which in turn can be used to explore spatial patterns in area data.

We will begin this chapter by briefly revisiting some of these notions. In the following chunk, we create a spatial weights matrix for Hamilton CMA census tracts based on the adjacency criterion:

# Function `poly2nb()` builds a list of neighbors based on contiguous boundaries. The argument for this function is an object of class "sf", which contains multi-polygon objects. 

# Function `nb2listw()` takes a list of neighbors and creates a matrix of spatial weights in the form of a list. Together, these two functions create a spatial weights matrix for the Census Tracts in Hamilton.

Hamilton_CT.nb <- poly2nb(pl = Hamilton_CT)
Hamilton_CT.w <- nb2listw(Hamilton_CT.nb)

Once that you have a matrix of spatial weights, it can be used to calculate the spatial moving average. In this example, we calculate the spatial moving average of the variable for population density, i.e., POP_DENSITY which is found in the sf dataframe:

# The function `lag.listw()` takes as argument the population density by census tracts in Hamilton, and calculates the moving average, with the "moving" part given by the local neighborhoods around each zone as defined by `Hamilton_CT.w`

POP_DENSITY.sma <- lag.listw(Hamilton_CT.w, Hamilton_CT$POP_DENSITY)

After calculating the spatial moving average of population density, we can join this new variable to the sf object:

Hamilton_CT$POP_DENSITY.sma <- POP_DENSITY.sma

As you saw in your last activity, the spatial moving average can be used in two ways to explore the spatial pattern of an area variable: as a smoother and by means of a scatterplot, combined with the original variable.

23.5 The Spatial Moving Average as a Smoother

The spatial moving average, when mapped, is essentially a smoothing technique. What do we mean by smoothing? By reporting the average of the neighbors instead of the actually observed value of the variable, we reduce the amount of variability that is communicated. This often can make it easier to distinguish the overall pattern, at the cost of some information loss (think of how when mapping quadrats we lost some information/detail by calculating the intensity for areas).

We can illustrate the use of the spatial moving average as a smoother with the help of a little simulation.

To simulate a random spatial variable, we can randomize the observations that we already have, reassigning them at random to areas in the system. This is accomplished as follows:

# By sampling at random and without replacement from the original variable, we create a null landscape. We will call this `POP_DENSITY_s1`, where the "s1" part is to indicate that this is our first simulated random landscape. We will actually repeat this process below.

POP_DENSITY_s1 <- sample(Hamilton_CT$POP_DENSITY)

Calculate the spatial moving average for this randomized variable (i.e., null landscape):

# We use the function `lag.listw()` to calculate the spatial moving average, but now for the null landscape we just simulated. 

POP_DENSITY_s1.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s1)

Once that you have seen how to randomize the variable, repeat the process to simulate a total of eight new variables/null landscapes, and calculate their spatial moving averages:

# Note that we are creating 8 null landscapes based on our original population density variable, and that we are calculating the spatial moving average for each of them. Each simulation has a new name: s2, s3, s4,..., s8. 

# Null landscape/simulation #2
POP_DENSITY_s2 <- sample(Hamilton_CT$POP_DENSITY)
POP_DENSITY_s2.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s2)

# Null landscape/simulation #3
POP_DENSITY_s3 <- sample(Hamilton_CT$POP_DENSITY)
POP_DENSITY_s3.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s3)

# Null landscape/simulation #4
POP_DENSITY_s4 <- sample(Hamilton_CT$POP_DENSITY)
POP_DENSITY_s4.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s4)

# Null landscape/simulation #5
POP_DENSITY_s5 <- sample(Hamilton_CT$POP_DENSITY)
POP_DENSITY_s5.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s5)

# Null landscape/simulation #6
POP_DENSITY_s6 <- sample(Hamilton_CT$POP_DENSITY)
POP_DENSITY_s6.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s6)

# Null landscape/simulation #7
POP_DENSITY_s7 <- sample(Hamilton_CT$POP_DENSITY)
POP_DENSITY_s7.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s7)

# Null landscape/simulation #8
POP_DENSITY_s8 <- sample(Hamilton_CT$POP_DENSITY)
POP_DENSITY_s8.sma <- lag.listw(Hamilton_CT.w, POP_DENSITY_s8)

Next, we will add all the null landscapes that you just simulated to the dataframes, as well as their spatial moving averages. This is useful for mapping and plotting purposes:

# Here we add the simulated landscapes to the `sf` dataframe.
Hamilton_CT$POP_DENSITY_s1 <- POP_DENSITY_s1
Hamilton_CT$POP_DENSITY_s2 <- POP_DENSITY_s2
Hamilton_CT$POP_DENSITY_s3 <- POP_DENSITY_s3
Hamilton_CT$POP_DENSITY_s4 <- POP_DENSITY_s4
Hamilton_CT$POP_DENSITY_s5 <- POP_DENSITY_s5
Hamilton_CT$POP_DENSITY_s6 <- POP_DENSITY_s6
Hamilton_CT$POP_DENSITY_s7 <- POP_DENSITY_s7
Hamilton_CT$POP_DENSITY_s8 <- POP_DENSITY_s8

# Here we add the spatial moving averages of the simulated landscapes to the `sf` dataframe.
Hamilton_CT$POP_DENSITY_s1.sma <- POP_DENSITY_s1.sma
Hamilton_CT$POP_DENSITY_s2.sma <- POP_DENSITY_s2.sma
Hamilton_CT$POP_DENSITY_s3.sma <- POP_DENSITY_s3.sma
Hamilton_CT$POP_DENSITY_s4.sma <- POP_DENSITY_s4.sma
Hamilton_CT$POP_DENSITY_s5.sma <- POP_DENSITY_s5.sma
Hamilton_CT$POP_DENSITY_s6.sma <- POP_DENSITY_s6.sma
Hamilton_CT$POP_DENSITY_s7.sma <- POP_DENSITY_s7.sma
Hamilton_CT$POP_DENSITY_s8.sma <- POP_DENSITY_s8.sma

It would be useful to compare the original landscape of population density to the null landscapes that you created before. To create a single figure with choropleth maps of the empirical variable and the eight simulated variables using the facet_wrap() function of ggplot2, we must first reorganize the data so that all the population density variables are in a single column, and all spatial moving average variables are also in a single column. Further, we need a new column to identifies which variable the values in this column correspond to. We will solve this little data management problem by copying only the data we are interested in into a new dataframe (by means of select()), and then gathering the spatial moving averages into a single column:

#"Hamilton_CT2 is a new dataframe. Here, the pipe operators (%>%) are used to pass the original dataframe to the select() function, and then the output of that is passed on to the `gather()` function. Notice that we are selecting the empirical spatial moving average and the 8 simulated instances of population densities. 

Hamilton_CT2 <- Hamilton_CT %>% # This pipe operator passes the dataframe to `select()`
  # `select()` keeps only the spatial moving averages and geometry
  select(POP_DENSITY.sma, 
         POP_DENSITY_s1.sma,
         POP_DENSITY_s2.sma,
         POP_DENSITY_s3.sma,
         POP_DENSITY_s4.sma,
         POP_DENSITY_s5.sma,
         POP_DENSITY_s6.sma,
         POP_DENSITY_s7.sma,
         POP_DENSITY_s8.sma,
         geometry) %>% # This pipe operator passes the dataframe with only the spatial moving average variables and the geometry to `gather()`
  # `gather()` places all variables with the exception of `geometry` in a single column named `DENSITY_SMA` and creates a new variable called `VAR` with the names of the original columns (i.e., POP_DENSITY.sma, POP_DENSITY_s1.sma, etc.)
  gather(VAR, DENSITY_SMA, -geometry)

Now the new dataframe with all spatial moving averages in a single column can be used to create choropleth maps. The function facet_wrap() is used to create facet plots so that we can place all maps in a single figure:

ggplot() + 
  geom_sf(data = Hamilton_CT2, 
          aes(fill = DENSITY_SMA), color = NA) + 
  facet_wrap(~VAR, ncol = 3) + # We are creating multiple plots for single data frame by means of the "facet_wrap" function.
  scale_fill_distiller(palette = "YlOrRd", direction = 1) + # Select palette for colors 
  labs(fill = "Pop Den SMA") + # Change the label of the legend
  theme(axis.text.x = element_blank(), 
        axis.text.y = element_blank()) # Remove the axis labels to avoid cluttering the plots

The empirical variable is the map in the upper left corner (labeled POP_DENSITY.sma). The remaining 8 maps are simulated variables. Would you say the map of the empirical variable is fairly different from the map of the simulated variables? What are the key differences?

An additional advantage of the spatial moving average is its use in the development of scatterplots. The information below provides further examples of exploring spatial moving averages with scatterplots.

23.6 Spatial Moving Average Scatterplots

Let us explore the use of spatial moving average scatterplots. First, we will extract the density information from the original sf object, reorganize, and bind to Hamilton_CT2 so that we can plot using faceting:

Hamilton_CT2 <- Hamilton_CT2 %>% # Pass `Hamilton_CT2` as the first argument of `data.frame()`
  data.frame(Hamilton_CT %>% # Pass `Hamilton_CT` to `st_drop_geometry()`
               st_drop_geometry() %>% # Drop the geometry because it is already available in `Hamilton_CT2`.
               # Select from `Hamilton_CT` the original population density and the 8 null landscapes simulated from it.
               select(POP_DENSITY,
                      POP_DENSITY_s1,
                      POP_DENSITY_s2,
                      POP_DENSITY_s3,
                      POP_DENSITY_s4,
                      POP_DENSITY_s5,
                      POP_DENSITY_s6,
                      POP_DENSITY_s7,
                      POP_DENSITY_s8) %>% # Pass the result to `gather()`  
               gather(VAR, DENSITY) %>% # Copy all density variables to a single column, and create a new variable called `VAR` with the names of the original columns (i.e., POP_DENSITY, POP_DENSITY_s1, etc.) 
               select(DENSITY)) # Drop VAR from the the dataframe

After reorganizing the data we can create the scatterplot of the empirical population density and its spatial moving average, as well as the scatterplots of the simulated variables and their spatial moving averages for comparison (the plots include the 45 degree line). Again, the use of facet_wrap() allows us to put all plots in a single figure:

#We are adding a geom and line (slope = 1)
ggplot(data = Hamilton_CT2, aes(x = DENSITY, y = DENSITY_SMA, color = VAR)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  coord_equal() +
  facet_wrap(~ VAR, ncol = 3)

What difference do you see between the empirical and simulated variables in these scatterplots?

It is possible to fit a line to the scatterplots (i.e., adding a regression line). This makes it easier to appreciate the difference between the empirical and simulated variables. This line would take the following form, with \(\beta\) as the slope of the line, and \(\alpha\) the intercept: \[ \overline{x_i} =\alpha + \beta x_i \]

Recreate the previous figure, but now add fitted lines to the scatterplots by means of the function geom_smooth(). The method “lm” means linear model, so the fitted line is a straight line:

ggplot(data = Hamilton_CT2, aes(x = DENSITY, y = DENSITY_SMA, color = VAR)) +
  geom_point(alpha = 0.1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  # Add a fitted line to the plots
  geom_smooth(method = "lm") +
  coord_equal() +
  facet_wrap(~ VAR, ncol = 3) 
## `geom_smooth()` using formula 'y ~ x'

You will notice that the slope of the line tends to be flat in the simulated variables; this is to be expected, since these variables are spatially random: the values of the variable at \(i\) are independent of the values of their local means!. In other words, the probability that the map is random is pretty high (in fact, since these 8 of these maps are null landscapes, we know for a fact that they are random).

The empirical variable, on the other hand, has a slope that is much closer to the 45 degree line. This indicates that the values of the variable at \(i\) are not independent of their local means: in other words, \(x_i\) is correlated with \(\overline{x_i}\), and the probability of a non-random pattern is high. This phenomenon is called spatial autocorrelation, and it is a fundamental way to describe spatial data. We will discuss this more extensively next.

23.7 Spatial Autocorrelation and Moran’s \(I\) coefficient

As seen above, the spatial moving average can provide evidence of the phenomenon of spatial autocorrelation, that is, when a variable displays spatial patterns whereby the values of a variable at zone \(i\) are not independent of the values of the variable in the neighborhood of zone \(i\).

A convenient modification to the concept of the spatial moving average is as follows. Instead of using the variable \(x\) for the calculation of the spatial moving average, we first center it on the global mean: \[ z_i = x_i - \bar{x} \]

In this way, the values of \(z_i\) are given in deviations from the mean. By forcing the variable to be centered on the mean, the slope of the fit line is forced to pass through the origin.

Calculate the mean-centered version of POP_DENSIT, and then its spatial moving average:

df_mean_center_scatterplot <- transmute(Hamilton_CT, # Modify values in dataframe
                                        Density_z = POP_DENSITY - mean(POP_DENSITY), # Subtract the mean, so that the variable now is deviations from the mean 
                                        SMA_z = lag.listw(Hamilton_CT.w, Density_z)) # Calculate the spatial moving average of the newly created variable `Density_z`

Compare the following two plots. You will see that they are identical, but in the mean-centered one the origin of the axes coincides with the means of \(x\) and the spatial moving average of \(x\). In other words, we have the same data, but we have displaced the origin of the plot:

# Create a scatterplot of population density and its spatial moving average
sc1 <- ggplot(data = filter(Hamilton_CT2, VAR == "POP_DENSITY.sma"),
              aes(x = DENSITY, y = DENSITY_SMA)) +
  geom_point(alpha = 0.1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  geom_smooth(method = "lm") +
  ggtitle("Population Density") +
  coord_equal()

# Create a scatterplot of the mean-centered population density, and its spatial moving average
sc2 <- ggplot(data = df_mean_center_scatterplot, 
              aes(x = Density_z, y = SMA_z)) +
  geom_point(alpha = 0.1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  geom_smooth(method = "lm", formula = y ~ x-1) +
  ggtitle("Mean-Centered Population Density") +
  coord_equal()

# Use patchwork to place the two plots in a single figure 
sc1 + sc2
## `geom_smooth()` using formula 'y ~ x'

How is it useful to displace the origin of the axes to the mean values of \(x\) and its spatial moving average? To explain this, notice that the values on the top scatterplot are all positive. The values on the bottom scatterplot are positive or negative, depending if they are above or below the mean. This sign is interesting. Notice what happens when the variable \(z_i\) multiplies its spatial moving average: \[ z_i\bar{z}_i = z_i\sum_{j=1}^n{w_{ij}^{st}z_j} \]

When \(z_i\) is above its mean, it is a positive value. When it is below the mean, it is a negative value. Likewise, when \(\bar{z}_i\) is above its mean, it is a positive value, and negative otherwise. The mean is a useful benchmark to see if values are relatively high, or relatively low.

There are four possibilities with respect to the combinations of (relatively) high and low values.

  1. Quadrant 1 (the value of \(z_i\) is high & the value of \(\bar{z}_i\) is also high):

If \(z_i\) is above the mean, it is a relatively high value in the distribution (signed positive). If its neighbors are also relatively high values, the spatial moving average will be above the mean, and also signed positive. Their product will be positive (positive times positive equals positive).

  1. Quadrant 2 (the value of \(z_i\) is low & the value of \(\bar{z}_i\) is high):

If \(z_i\) is below the mean, it is a relatively low value in the distribution (signed negative). If its neighbors in contrast are relatively high values, the spatial moving average will be above the mean, and signed positive. Their product will be negative (negative times positive equals negative).

  1. Quadrant 3 (the value of \(z_i\) is low & the value of \(\bar{z}_i\) is also low):

If \(z_i\) is below the mean, it is a relatively low value in the distribution (signed negative). If its neighbors are also relatively low values, the spatial moving average will be below the mean, and also signed negative. Their product will be positive (negative times negative equals positive).

  1. Quadrant 4 (the value of \(z_i\) is high & the value of \(\bar{z}_i\) is low):

If \(z_i\) is above the mean, it is a relatively high value in the distribution (signed positive). If its neighbors are relatively low values, the spatial moving average will be below the mean, and signed negative. Their product will be negative (positive times negative equals negative).

These four quadrants are shown in the following plot:

ggplot(data = df_mean_center_scatterplot, 
       aes(x = Density_z, y = SMA_z)) +
  geom_point(color = "gray") +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  # You can also add annotations to plots by using `annotate()`. The inputs are the kind of annotation; in this case "text", but it could be circles, arrows, rectangles, labels, and other things. For text, you need a label, and coordinates for the annotation.
  annotate("text", label = "Q1: Positive", x= 2000, y = 2500) +
  annotate("text", label = "Q4: Negative", x= 2000, y = -2500) +
  annotate("text", label = "Q2: Negative", x= -2000, y = 2500) +
  annotate("text", label = "Q3: Positive", x= -2000, y = -2500) +
  coord_equal()

We can take the products of \(z_i\) by \(\bar{z}_i\) for all \(i\) and add them: \[ \sum_{i=1}^n{z_i\overline{z_i}} = \sum_{i=1}^n{z_i\sum_{j=1}^n{w_{ij}^{st}z_j}} \]

If many dots are in Quadrants 1 and 3 in the scatterplot, the sum of the products will tend to be a large positive number. On the other hand, if many dots are in Quadrants 2 and 4, the sum of the products will tend to be a large number, but negative. Either case would be indicative of a pattern:

  1. If the sum is positive, this would suggest that high & high values tend to be together, while low & low values also tend to be together.

  2. In contrast, if the sum is negative, this would suggest that high values tend to be surrounded by low values, and vice-versa.

Finally, if the dots are scattered over the four quadrants, some products will be positive and some will be negative, and they will tend to cancel each other when summed. In this way, the sum of the products will tend to be closer to zero.

23.8 Moran’s \(I\) and Moran’s Scatterplot

Based on the discussion above, let us define the following coefficient, called Moran’s I: \[ I = \frac{\sum_{i=1}^n{z_i\sum_{j=1}^n{w_{ij}^{st}z_j}}}{\sum_{i=1}^{n}{z_i^2}} \]

The numerator in this expression is the sum of the products described above. The denominator is the variance of variable \(x_i\), and is used here to scale Moran’s \(I\) so that it is contained roughly in the interval \((-1, 1)\) (the exact bounds depend on the characteristics of the zoning system).

Moran’s \(I\) is a coefficient of spatial autocorrelation.

We can calculate Moran’s \(I\) as follows, using as an example the mean-centered population density (notice how it is the sum of the products of \(z_i\) by their spatial moving averages \(\bar{z}_i\), divided by the variance):

# Try to decipher the formula. You should be able to see that we are calculating the sum of the products by their spatial moving averages, divided by variance
sum(df_mean_center_scatterplot$Density_z *  df_mean_center_scatterplot$SMA_z) / sum(df_mean_center_scatterplot$Density_z^2)
## [1] 0.5179736

Since the value is positive, and relatively high, this would suggest a non-random spatial pattern of similar values (i.e., high & high and low & low).

Moran’s \(I\) is implemented in R in the spdep package, which makes its calculation easy, since you do not have to go manually through the process of calculating the spatial moving averages, etc.

The function moran() requires as input arguments a variable, a set of spatial weights, the number of zones (\(n\)), and the total sum of all weights (termed \(S_0\)) - which in the case of row-standardized spatial weights is equal to the number of zones. Therefore:

mc <- moran(Hamilton_CT$POP_DENSITY, Hamilton_CT.w, n = 188, S0 =  188)
mc$I
## [1] 0.5179736

You can verify that this matches the value calculated above. The kind of scatterplots that we used previously are called Moran’s scatterplots, and they can also be created easily by means of the moran.plot() function of the spdep package:

# Confirming the results from the Moran coefficient above. We use "moran.plot" to illustrate the SMA of population density by census tract in Hamilton. 
mp <- moran.plot(Hamilton_CT$POP_DENSITY, Hamilton_CT.w)

23.9 Hypothesis Testing for Spatial Autocorrelation

The tools described so far are useful to suggest whether a pattern is random; however, while inspection of the scatterplot is suggestive, we would like a more formal criterion to decide whether the pattern is random. Fortunately, Moran’s \(I\) can be used to develop a test of hypothesis. The expected value of Moran’s \(I\) under the null hypothesis of spatial randomness (or independence), as well as its variance, have been derived.

A test for autocorrelation based on Moran’s \(I\) is implemented in the spdep package:

#"moran.test" is calculating spatial autocorrelation of population density in Hamilton census tracts
moran.test(Hamilton_CT$POP_DENSITY, Hamilton_CT.w)
## 
##  Moran I test under randomisation
## 
## data:  Hamilton_CT$POP_DENSITY  
## weights: Hamilton_CT.w    
## 
## Moran I statistic standard deviate = 12.722, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.517973553      -0.005347594       0.001691977

Since the null hypothesis is of spatial independence, the \(p\)-value of the statistic is interpreted as the probability of making a mistake by rejecting the null hypothesis. In the present case, the \(p\)-value is such a small number that we can reject the null hypothesis with a high degree of confidence.

Moran’s \(I\) and Moran’s scatterplots are among the most widely used tools in the analysis of spatial area data.

24 Activity 11: Area Data III

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

24.1 Practice questions

Answer the following questions:

  1. What does the 45 degree line in the scatterplot of spatial moving averages indicate?
  2. What is the effect of centering a variable around the mean?
  3. In your own words, describe the phenomenon of spatial autocorrelation.
  4. What is the null hypothesis in the test of autocorrelation based on Moran’s I?

24.2 Learning objectives

In this activity, you will:

  1. Calculate Moran’s I coefficient of autocorrelation for area data.
  2. Create Moran’s scatterplots.
  3. Examine the results of the tests/scatterplots for further insights.
  4. Think about ways to decide whether a landscape is random when working with area data.

24.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

24.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity.

In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn about sf here) and spdep, a package that implements several spatial statistical methods (you can learn more about it here):

library(isdas)
library(sf)
library(spdep)
library(tidyverse)

Begin by loading the data that you will use in this activity:

data(Hamilton_CT)

This is a sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada.

You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49:

Hamilton_CT <- mutate(Hamilton_CT, Prop20to34 = (AGE_20_TO_24 + AGE_25_TO_29 + AGE_30_TO_34)/POPULATION, Prop35to49 = (AGE_35_TO_39 + AGE_40_TO_44 + AGE_45_TO_49)/POPULATION)

You are now ready for the next activity.

24.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Create a spatial weights matrix for the census tracts in the Hamilton CMA.

  2. Use moran.test to test the following variables for spatial autocorrelation: proportion of the population who are 20 to 34 years old, 35 to 49 years old, 50 to 65 years old, and 65 and older.

  3. Use moran.plot() to create Moran’s scatterplots to complement your tests of spatial autocorrelation.

Activity Part II

  1. How confident are you deciding whether the variables under analysis are not spatially random? What can you say regarding the relative strength of the spatial pattern of these variables?

  2. Show a fellow student the Moran’s scatterplots you created in point 3. What can you tell about the spatial pattern based on these scatterplots? Create choropleth maps for the variables. If the spatial pattern is not random, what kind of process might have led to the patterns you observe?

  3. The scatterplots created using moran.plot include some observations that are labeled with their id and a different symbol. Why do you think these observations are highlighted in such a way?

25 Area Data IV

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

25.1 Learning objectives

In the previous practice/session, you learned about the concept of spatial autocorrelation, and how it can be used to evaluate statistical maps when searching for patterns. We also introduced Moran’s \(I\) coefficient, one of the most widely used tools to measure spatial autocorrelation.

In this practice, you will learn about:

  1. Decomposing Moran’s \(I\).
  2. Local Moran’s \(I\) and mapping.
  3. A concentration approach for local analysis of spatial association.
  4. A short note on hypothesis testing.
  5. Detection of hot and cold spots.

25.2 Suggested readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

25.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(crosstalk)
library(isdas)
library(plotly)
library(sf)
library(spdep)
library(tidyverse)

Load the datasets:

data("df1_simulated")
data("df2_simulated")

These two dataframes are simulated landscapes, one completely random and another stochastic with a strong systematic pattern. Note that the descriptive statistics of both variables are identical.:

summary(df1_simulated)
##        x               y               z        
##  Min.   : 1.00   Min.   : 1.00   Min.   :24.40  
##  1st Qu.:27.00   1st Qu.:19.00   1st Qu.:27.89  
##  Median :46.50   Median :33.00   Median :30.33  
##  Mean   :45.61   Mean   :31.63   Mean   :34.38  
##  3rd Qu.:66.00   3rd Qu.:45.00   3rd Qu.:38.25  
##  Max.   :87.00   Max.   :61.00   Max.   :69.59
summary(df2_simulated)
##        x               y               z        
##  Min.   : 1.00   Min.   : 1.00   Min.   :24.40  
##  1st Qu.:27.00   1st Qu.:19.00   1st Qu.:27.89  
##  Median :46.50   Median :33.00   Median :30.33  
##  Mean   :45.61   Mean   :31.63   Mean   :34.38  
##  3rd Qu.:66.00   3rd Qu.:45.00   3rd Qu.:38.25  
##  Max.   :87.00   Max.   :61.00   Max.   :69.59

The third dataset is an object of class sf (simple feature) with the census tracts of Hamilton CMA and some selected population variables from the 2011 Census of Canada:

data(Hamilton_CT)

25.4 Decomposing Moran’s \(I\)

Here we will revisit Moran’s \(I\) coefficient to see how its utility for the exploration of spatial patterns can be extended. Recall from the preceding reading and activity that this coefficient of spatial autocorrelation was derived based on the idea of aggregating the products of a (mean-centered) variable by its spatial moving average, and then dividing by the variance: \[ I = \frac{\sum_{i=1}^n{z_i\sum_{j=1}^n{w_{ij}^{st}z_j}}}{\sum_{i=1}^{n}{z_i^2}} \]

Also, remember that when plotting Moran’s scatterplot using moran.plot() some observations were highlighted. To see this, we will recreate the plot, for which we need a set of spatial weights:

Hamilton_CT.w <- nb2listw(poly2nb(pl = Hamilton_CT))

And here is the scatterplot of population density again:

# We can use the arguments xlab and ylab in `moran.plot()` to change the labels for the two axes of the plot
mp <- moran.plot(Hamilton_CT$POP_DENSITY, Hamilton_CT.w, xlab = "Population Density", ylab = "Lagged Population Density")

The reason some observations are highlighted is because they have been identified as “influential”, meaning that they make a particularly large contribution to the calculation of \(I\). It turns out that the relative contribution of each observation to the calculation of Moran’s \(I\) is informative in and of itself, and its analysis can provide more focused information about the spatial pattern.

To explore this, we will recreate the scatterplot manually to have better control of its aspect. To do this, we first create a dataframe with the mean-centered and scaled variable \(z_i=(x_i-\overline{x})/\sum z_i^2\), and its spatial moving average. We will also create a factor variable (call it Type) to identify the type of spatial relationship (Low & Low, if both \(z_i\) and its spatial moving average are negative, High & High, if both \(z_i\) and its spatial moving average are positive, and Low & High/High & Low otherwise). This is information is useful for mapping the results:

Hamilton_CT <- Hamilton_CT %>% # Use the pipe operator to pass the dataframe as an argument to `mutate()`, which is used to create new variables.
  mutate(Z = (POP_DENSITY - mean(POP_DENSITY)) / var(POP_DENSITY), # Create a mean-centered variable that is standardized by the variance.
         SMA = lag.listw(Hamilton_CT.w, Z), # Calculate the spatial moving average of variable `Z`.
         # The function `case_when()` is used to evaluate several logical conditions and respond to them. 
         Type = case_when(Z < 0 & SMA < 0 ~ "LL",
                          Z > 0 & SMA > 0 ~ "HH",
                          TRUE ~ "HL/LH"))

Next, we will create the scatterplot and a choropleth map of the population density. The package plotly is used to create interactive plots. Read more about how to visualize geospatial information with plotly here. The package crosstalk allows us to link two plots for brushing (brushing is a visualization technique that links several plots in a dynamic way to highlight some elements of interest).

To create an interactive plot for linking and brushing we first, create a SharedData object to link two plots:

# Create a shared data object for brushing.
df_msc.sd <- SharedData$new(Hamilton_CT)

The function bscols() (for bootstrap columns) is used to array two plotly objects; the first of these is a scatterplot, and the second is a choropleth map of population density.

bscols(
  # The first plot is Moran's scatterplot
  plot_ly(df_msc.sd) %>% # Create a `plotly` object using the dataframe as an input. The pipe operator passes this object to the function `add_markers()`; this function is similar to the `geom_point()` function in `ggplot2` and it draws objects on the blank plot created by `plot_ly()`
    add_markers(x = ~Z, y = ~SMA, color = ~POP_DENSITY, size = ~(Z * SMA), colors = "YlOrRd") %>%
    hide_colorbar() %>%     # Remove the colorbar from the plot.
    highlight("plotly_selected"), # Highlight observations when selected.
  # The second plot is a choropleth map
  plot_ly(df_msc.sd) %>% # Create a `plotly` object using the dataframe as an input. The pipe operator passes this object to the function `add_sf()`; this function is similar to the `geom_sf()` functions in `ggplot2` and it draws a simple features object on the blank plot created by `plot_ly()`
    add_sf(split = ~TRACT, color = ~POP_DENSITY, colors = "YlOrRd", showlegend = FALSE) %>%
    hide_colorbar() %>% # Remove colorbar from the plot.
    highlight(dynamic = TRUE) # Highlight observations when selected.
)

The darker colors are zones with higher population densities. The size of the dots in the scatterplot indicates the contributions of the zone to Moran’s \(I\). The darker colors in the choropleth map are higher population densities.Since the plots are linked for brushing, it is possible to selecting groups of dots in the scatterplot (double click to clear a selection). Change the color for brushing to select a different group of dots. Can you identify in the map the zones that most contribute to Moran’s \(I\)?

The direct relationship between the dots in the scatterplot and the values of the variable in the map suggest the following decomposition of Moran’s \(I\).

25.5 Local Moran’s \(I\) and Mapping

A possible decomposition of Moran’s \(I\) into local components is as follows (see Anselin 1995) (Available here): \[ I_i = \frac{z_i}{m_2}\sum_{j=1}^n{w_{ij}^{st}z_j} \] where \(z_i\) is a mean-centered variable, and: \[ m_2 = \sum_{i=1}^n{z_i^2} \] is its variance. \(I_i\) is called local Moran’s \(I\). It is straightforward to see that: \[ I = \sum_{i=1}^n{I_i} \]

In other words, the coefficients \(I_i\) when summed equal \(I\). To distinguish between these, we will call our Moran’s \(I\) coefficient a global statistic: there is one value for a map and it describes overall autocorrelation. \(I_i\), in turn, we will call a local statistic: it can be calculated locally for a location of interest, and describes autocorrelation for that location, as well as the contribution of that location to the global statistic.

An advantage of the local decomposition described here is that it allows an analyst to map the statistic to better understand the spatial pattern. The local version of Moran’s \(I\) is implemented in spdep as localmoran(), and can be called with a variable and a set of spatial weights as arguments:

POP_DENSITY.lm <- localmoran(Hamilton_CT$POP_DENSITY, Hamilton_CT.w)

The value (output) of the function is a matrix with local Moran’s \(I\) coefficients (i.e., \(I_i\)), and their corresponding expected values and variances (used for hypothesis testing; more on this next). You can check the summary to verify the contents:

summary(POP_DENSITY.lm)
##        Ii                E.Ii                Var.Ii              Z.Ii          Pr(z != E(Ii))   
##  Min.   :-0.62144   Min.   :-0.1506849   Min.   :0.000004   Min.   :-2.36042   Min.   :0.00000  
##  1st Qu.: 0.00478   1st Qu.:-0.0054846   1st Qu.:0.012609   1st Qu.: 0.08777   1st Qu.:0.07346  
##  Median : 0.12523   Median :-0.0017893   Median :0.048380   Median : 0.86588   Median :0.31535  
##  Mean   : 0.51797   Mean   :-0.0053476   Mean   :0.169109   Mean   : 1.00476   Mean   :0.36541  
##  3rd Qu.: 0.59384   3rd Qu.:-0.0004661   3rd Qu.:0.159707   3rd Qu.: 1.77619   3rd Qu.:0.61867  
##  Max.   : 8.30454   Max.   :-0.0000002   Max.   :4.708525   Max.   : 5.83338   Max.   :0.99755

Rename the columns for convenience:

colnames(POP_DENSITY.lm) <- c("Ii", "E.Ii", "Var.Ii", "Z.Ii", "p.val")

Similar to the global version of Moran’s \(I\), hypothesis testing can be conducted by comparing the empirical statistic to its distribution under the null hypothesis of spatial independence. The function localmoran reports p-values to this end.

For further exploration, join the local statistics to the dataframe:

Hamilton_CT <- Hamilton_CT %>% 
  left_join(data.frame(TRACT = Hamilton_CT$TRACT, 
                       POP_DENSITY.lm), 
            by = "TRACT") # Join the results of `localmoran()` to the dataframe

Now it is possible to map the local statistics. Since we added the \(p\)-value of the local statistics, we can distinguish between those with small (say, less than 0.05) and large \(p\)-values:

# The function `add_sf()` draws a simple features object, similar to `geom_sf()` in `ggplot2`. We "split" observations based on their p-values: if the p-value is less than 0.05, the condition is "TRUE" and otherwise it is "FALSE". Finally, we color the zones based on their `Type`: that is, whether they are High & High according to the local statistic, or Low & Low, etc.
  plot_ly(Hamilton_CT) %>%
    add_sf(type = "scatter",
           split = ~(p.val < 0.05), 
           color = ~Type, 
           colors = c("red", 
                      "khaki1", 
                      "dodgerblue", 
                      "dodgerblue4")) 

The map above shows whether population density in a zone is high, surrounded by other zones with high population densities (HH), or low, surrounded by zones that also have low population density (LL). Other zones have either low population densities and are surrounded by zones with high population density, or vice-versa (HL/LH).

Click on the legend to filter by category of TRUE-FALSE and HH-LL-HL/LH.

This map allows you to identify what we could call the downtown core (from the perspective of population density), and the most suburban-rural census tracts in the Hamilton CMA.

While mapping \(I_i\) or their corresponding \(p\)-values is straightforward, I personally find it more useful to map whether the zones are of type HH, LH, or HL/LH. Since such maps are not (to the best of my knowledge) the output of an existing function in an R package, so we will create one here.

# A function is a way of packaging a set of standard instructions. Here, we package all the steps we used above to create the map of the local Moran coefficients in a new function called `localmoran.map()`
localmoran.map <- function(p, listw, VAR, by){
  # p is a simple features object
  require(tidyverse)
  require(spdep)
  require(plotly)
  
  df_msc <- p %>% 
    rename(VAR = as.name(VAR),
              key = as.name(by)) %>%
    transmute(key,
              VAR,
              Z = (VAR - mean(VAR)) / var(VAR),
              SMA = lag.listw(listw, Z),
              Type = case_when(Z < 0 & SMA < 0 ~ "LL",
                               Z > 0 & SMA > 0 ~ "HH",
                               TRUE ~ "HL/LH"))
  
  local_I <- localmoran(df_msc$VAR, listw)
  
  colnames(local_I) <- c("Ii", "E.Ii", "Var.Ii", "Z.Ii", "p.val")
  
  df_msc <- left_join(df_msc, 
                      data.frame(key = df_msc$key, 
                                 local_I),
                      by = "key")
  
  plot_ly(df_msc) %>%
    add_sf(type = "scatter",
           split = ~(p.val < 0.05), 
           color = ~Type, 
           colors = c("red", 
                      "khaki1",
                      "dodgerblue", 
                      "dodgerblue4")) 
}

Notice how this function simply replicates the steps that we followed earlier to create the map with the results of the local Moran coefficients.

To use this function you need as inputs an object of class sf, a listw object with spatial weights, and to define the variable of interest and a unique identifier for the areas (such as their tract identifiers). For example:

localmoran.map(Hamilton_CT, Hamilton_CT.w, 
               "POP_DENSITY", 
               by = "TRACT")

There, the function creates the map as desired.

25.6 A Quick Note on Functions

Once that you know the steps needed to complete a task, if the task needs to be repeated many times possibly using different inputs, a function is a way of packing those instructions in a convenient way. That is all.

25.7 A Concentration approach for Local Analysis of Spatial Association

The local version of Moran’s \(I\) is one of the most widely used tools of a family of measures called Local Statistics of Spatial Association or LISA. It is not the only one, however.

In this section, we will see an alternative way of exploring spatial patterns locally, by means of a concentration approach.

To introduce this new approach, imagine a landscape with a variable that can be measured in a ratio scale with a true zero point (say, population, income, a contaminant, or property values, variables that do not take negative values and the value of zero indicates complete absence).

Imagine that you stand at a given location on that landscape and survey your surroundings. If your surroundings look very similar to the location where you stand (i.e., if their elevation is similar, relative to the rest of the landscape), you would take that as evidence of a spatial pattern, at least locally. This is the fundamental idea behind spatial autocorrelation analysis.

As an alternative, imagine for instance that the variable of interest is, say, personal income. You might ask “how much of the regional wealth can be found in my neighborhood?” (or, if you prefer, imagine that the variable is a contaminant, and your question is, how much of it is around here?)

Imagine now that personal income is spatially random. What would you expect the share of the wealth to be in your neighborhood? Would that share change if you moved to any other location?

Lets elaborate this thought experiment. Take the df1 dataframe. The total sum of this variable in the region is 12,034.34. See:

sum(df1_simulated$z)
## [1] 12034.34

The following is an interactive plot of variable z in the sample dataframe df1. This variable is spatially random:

# Define how variables in the table are represented in the plot: for instance, the variable `x` corresponds to the x axis. Next, define the properties of the markers, or geometric objects in the plot. For example, their color will be proportional to variable `z` 
plot_ly(df1_simulated, 
        x = ~x, 
        y = ~y, 
        z = ~z, 
        marker = list(color = ~z, 
                      colorscale = c('#FFE1A1', 
                                     '#683531'), 
                      showscale = TRUE)) %>%
  add_markers()

Imagine that you stand at coordinates x = 53 and y = 34 (we will call this location the focal point), and you survey the landscape within a radius \(r\) of 10 (units of distance) of this location. How much wealth is concentrated in the neighborhood of the focal point? Lets see:

# Define the focal point
xy0 <- c(53, 34)
# Select a radius
r <- 10
# Extract observations that are within a radius of `r` from focal point `xy0` (note that sqrt((x - xy0)^2 + (x - xy0)^2) is Pythagoras's formula for calculating the distance between two points; if this distance is less than `r`, the point is kept)
df1_simulated %>% 
  subset(sqrt((x - xy0[1])^2 + (y - xy0[2])^2) < r) %>%
  select(z) %>% 
  sum()
## [1] 832.0156

Here, we calculated how much of the variable is present locally around the focal point. Recall that the total of the variable for the region is 12,034.34.

If you change the radius r to a very large number, the concentration of the variable will simply become the total sum of the variable for the region. Essentially, the whole region is the “neighborhood” of the focal point. Try it.

Now, for a fixed radius, change the focal point, and see how much the concentration of the variable changes for its neighborhood. How does the concentration of the variable by focal point?

We will now repeat the thought experiment but now with the landscape shown in the following figure:

plot_ly(df2_simulated, 
        x = ~x, 
        y = ~y, 
        z = ~z,
        marker = list(color = ~z,
                      colorscale = c('#FFE1A1', 
                                     '#683531'), 
                      showscale = TRUE)) %>%
  add_markers()

Imagine that you stand at the focal point with coordinates x = 53 and y = 34. Can you identify the point in the plot? If you surveyed the neighborhood, what would be the concentration of wealth there? How would that change as you visited different focal points? Lets see (again, recall that the total of the variable for the whole region is 12,034.34):

xy0 <- c(53, 34)
# Select a radius
r <- 10
# Extract observations that are within a radius of `r` from focal point `xy0` (note that sqrt((x - xy0)^2 + (x - xy0)^2) is Pythagoras's formula for calculating the distance between two points; if this distance is less than `r`, the point is kept)
df2_simulated %>% 
  subset(sqrt((x - xy0[1])^2 + (y - xy0[2])^2) < r) %>%
  select(z) %>% 
  sum()
## [1] 1316.884

Change the focal point. How does the concentration of the variable change?

We are now ready to define the following measure of local concentration (see Getis and Ord, 1992): \[ G_i^*(d)=\frac{\sum_{j=1}^n{w_{ij}x_j}}{\sum_{i=i}^{n}x_{i}} \]

Notice that the spatial weights are not row-standardized, and in fact must be a binary variable as follows: \[ w_{ij}=\bigg\{\begin{array}{l l} 1\text{ if } d_{ij}\leq d\\ 0\text{ otherwise}\\ \end{array} \]

This is because in this measure of concentration, we do not calculate the spatial moving average for the neighborhood, but the total of the variable in the neighborhood.

A variant of this statistic removes from the sum the value of the variable at i: \[ G_i(d)=\frac{\sum_{j\neq i}^n{w_{ij}x_j}}{\sum_{i=i}^{n}x_{i}} \]

I do not find this definition to be particularly useful. I suspect it was defined to resemble Moran’s \(I\) where an area is not it’s own neighbor - which makes sense in an autocorrelation sense (an area is perfectly autocorrelated with itself). In a concentration approach, not using the value at \(i\) is less appealing.

As with the local version of Moran’s \(I\), it is possible to map the statistic to better understand the spatial pattern.

The \(G_i^*(d)\) and \(G_i(d)\) statistics are implemented in spdep as localG, and can be called with a variable and a set of spatial weights as arguments.

WE will calculate this statistic for the two datasets in the example above. This requires that we create binary spatial weights. Begin by creating neighbors by distance:

# Create a matrix of coordinates.
xy_coord <- cbind(df1_simulated$x, df1_simulated$y)
# Find all nearest neighbors that are withing 0 and 10 units of distance away from every observation.
dn10 <- dnearneigh(xy_coord, 0, 10)

There are two differences with the procedure that we used before to create spatial weights. First, when we created spatial weights for Moran’s \(I\) coefficient, we stated that an observation is not its own neighbor. For the concentration approach, we might prefer to say that an observation is in the neighborhood of interest (being at its center). For this reason, we might opt to include the observation at \(i\) (therefore include.self()). And secondly, the style of the matrix is now “B” (for binary):

# Convert the nearest neighbors `nb` object to spatial weights
wb10 <- nb2listw(include.self(dn10), style = "B")

The local statistics can be obtained as follows:

# The arguments of this function are a spatial variable and a list of spatial weights
df1.lg <- localG(df1_simulated$z, wb10)

The value (output) of the function is a ’vector localG object with normalized local statistics. Normalized means that the mean under the null hypothesis has been subtracted and the result has been divided by the variance under the null. Normalized statistics can be compared to the standard normal distribution for hypothesis testing. You can check the summary to verify the contents:

summary(df1.lg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6345 -0.5085  0.1401  0.0657  0.5911  2.6638

The function localG() does not report the \(p\)-values, but they are relatively easy to calculate:

df1.lg <- as.numeric(df1.lg)
df1.lg <- data.frame(Gstar = df1.lg, p.val = 2 * pnorm(abs(df1.lg), lower.tail = FALSE))

How many of the \(p\)-values are less than the conventional decision cutoff of 0.05?

Now the second example:

df2.lg <- localG(df2_simulated$z, wb10)
summary(df2.lg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.2400 -2.6791 -1.3999  0.1503  2.3938 12.2401

Adding \(p\)-values:

df2.lg <- as.numeric(df2.lg)
df2.lg <- data.frame(Gstar = df2.lg, p.val = 2 * pnorm(abs(df2.lg), lower.tail = FALSE))

If we bind the results of the \(G_i^*(d)\) analysis to the dataframe, we can plot the results for further exploration. We will classify the results by their type, in this case high and low concentrations:

df2 <- cbind(df2_simulated[,1:3],df2.lg)
df2 <- df2 %>%
  mutate(Type = case_when(Gstar < 0 & p.val <= 0.05 ~ "Low Concentration",
                          Gstar > 0 & p.val <= 0.05 ~ "High Concentration", 
                          TRUE ~ "Not Signicant"))

And then the plot, but now color the points depending on whether they are high or low concentrations, and whether their \(p\)-values are lower than 0.05:

plot_ly(df2, 
        x = ~x,
        y = ~y, 
        z = ~z, 
        color = ~Type, 
        colors = c("red", 
                   "blue", 
                   "gray"),
        marker = list()) %>%
  add_markers()

What kind of pattern do you observe?

25.8 A Short Note on Hypothesis Testing

Local tests as introduced above are affected by an issue called multiple testing. Typically, when attempting to assess the significance of a statistic, a level of significance is adopted (conventionally 0.05). When working with local statistics, we typically conduct many tests of hypothesis simultaneously (in the example above, one for each observation).

A risk when conducting a large number of tests is that some of them might appear significant purely by chance! The more tests we conduct, the more likely that at least a few of them will appear to be significant by chance. For instance, in the preceding example the variable in df1 was spatially random, and yet a few observations had p-values smaller than 0.05.

What this suggests is that some correction to the level of significance used is needed.

A crude rule to make this adjustment is called a Bonferroni correction. This correction is as follows: \[ \alpha_B = \frac{\alpha_{nominal}}{m} \] where \(\alpha_{nominal}\) is the nominal level of significance, \(\alpha_B\) is the adjusted level of significance, and \(m\) is the number of simultaneous tests. This correction requires that each test be evaluated at a lower level of significance \(\alpha_B\) in order to to achieve a nominal level of significance of 0.05.

If we apply this correction to the analysis above, we see that instead of 0.05, the p-value needed for significance is much lower:

alpha_B <- 0.05/nrow(df1_simulated)
alpha_B
## [1] 0.0001428571

You can verify now that no observations in df1 show up as significant:

sum(df1.lg$p.val <= alpha_B)
## [1] 0

If we examine the variable in df2:

df2 <- mutate(df2, 
              Type = case_when(Gstar < 0 & p.val <= alpha_B ~ "Low Concentration",
                               Gstar > 0 & p.val <= alpha_B ~ "High Concentration",
                               TRUE ~ "Not Signicant"),
              factor = Type)

plot_ly(df2, 
        x = ~x, 
        y = ~y, 
        z = ~z, 
        color = ~Type, 
        colors = c("red", 
                   "blue",
                   "gray"),
        marker = list()) %>%
  add_markers()

You will see that fewer observations are significant, but it is still possible to detect two regions of high concentration, and two of low concentration.

The Bonferroni correction is known to be overly strict, and sharper approaches exist to correct for multiple testing. Between the nominal level of significance (no correction) and the level of significance with Bonferroni correction, it is still possible to assess the gravity of the issue of multiple comparisons. Observations that are flagged as significant with the Bonferroni correction, will also be significant under more refined corrections, so it provides the most conservative decision rule.

25.9 Detection of Hot and Cold Spots

As the examples above illustrate, local statistics can be very useful in detecting what might be termed “hot” and “cold” spots. A hot spot is a group of observations that are significantly high, whereas a cold spot is a group of observations that are significantly low.

There are many different applications where hot/cold spot detection is important.

For instance, in many studies of urban form, it is important to identify centers and subcenters - by population, by property values, by incidence of trips, and so on. In spatial criminology, detecting hot spots of crime can help with prevention and law enforcement efforts. In environmental studies, remediation efforts can be greatly assisted by identification of hot areas. In spatial epidemiology hot spots can indicate locations were a large number of cases of a disease have been observed. There are countless applications of this.

25.10 Other Resources

Check a cool app that illustrates the \(G_i^*\) statistic here

26 Activity 12: Area Data IV

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

26.1 Practice questions

Answer the following questions:

  1. How are row-standardized and binary spatial weights interpreted?
  2. What is the reason for using a Bonferroni correction for multiple tests?
  3. What types of spatial patterns can the local version of Moran’s I detect?
  4. What types of spatial patterns can the \(G_i(d)\) statistic detect?
  5. What is the utility of detecting hot and cold spatial spots?

26.2 Learning objectives

In this activity, you will:

  1. Calculate Moran’s I coefficient of autocorrelation for area data.
  2. Create Moran’s scatterplots.
  3. Examine the results of the tests/scatterplots for further insights.
  4. Think about ways to decide whether a landscape is random when working with area data.

26.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

26.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity.

In addition to tidyverse, you will need sf, a package that implements simple features in R (you can learn about sf here) and spdep, a package that implements several spatial statistical methods (you can learn more about it here):

library(isdas)
library(sf)
library(spdep)
library(tidyverse)

Begin by loading the data that you will use in this activity:

data(Hamilton_CT)

This is a sf object with census tracts and selected demographic variables for the Hamilton CMA in Canada. You can obtain new (calculated) variables as follows. For instance, to obtain the proportion of residents who are between 20 and 34 years old, and between 35 and 49:

Hamilton_CT <- mutate(Hamilton_CT, Prop20to34 = (AGE_20_TO_24 + AGE_25_TO_29 + AGE_30_TO_34)/POPULATION, Prop35to49 = (AGE_35_TO_39 + AGE_40_TO_44 + AGE_45_TO_49)/POPULATION)

This function is used to create local Moran maps:

localmoran.map <- function(p = p, listw = listw, VAR = VAR, by = by){
  require(tidyverse)
  require(spdep)
  require(plotly)
  
  df_msc <- transmute(p,
                      key = p[[by]],
                      Z = (p[[VAR]] - mean(p[[VAR]])) / var(p[[VAR]]),
                      SMA = lag.listw(listw, Z),
                      Type = case_when(Z < 0 & SMA < 0 ~ "LL",
                                       Z > 0 & SMA > 0 ~ "HH",
                                       TRUE ~ "HL/LH"))
  
  local_I <- localmoran(p[[VAR]], listw)
  
  df_msc <- left_join(df_msc, 
                  data.frame(key = p[[by]], local_I))
  df_msc <- rename(df_msc, p.val = Pr.z...0.)
  
  plot_ly(df_msc) %>%
    add_sf(split = ~(p.val < 0.05), color = ~Type, colors = c("red", "khaki1", "dodgerblue", "dodgerblue4")) 
}

This function is used to create \(G_i^*\) maps:

gistar.map <- function(p = p, listw = listw, VAR = VAR, by = by){
  require(tidyverse)
  require(spdep)
  require(sf)
  require(plotly)
  
  p <- mutate(p, key = p[[by]])
  
  df.lg <- localG(p[[VAR]], listw)
  df.lg <- as.numeric(df.lg)
  df.lg <- data.frame(Gstar = df.lg, p.val = 2 * pnorm(abs(df.lg), lower.tail = FALSE))
  
  df.lg <- mutate(df.lg, 
              Type = case_when(Gstar < 0 & p.val <= 0.05 ~ "Low Concentration",
                               Gstar > 0 & p.val <= 0.05 ~ "High Concentration",
                               TRUE ~ "Not Signicant"))

  p <- left_join(p, 
                  data.frame(key = p[[by]], df.lg))
  
  plot_ly(p) %>%
    add_sf(split = ~(p.val < 0.05), color = ~Type, colors = c("red", "dodgerblue", "gray"))
}

Create spatial weights.

  1. By contiguity:
Hamilton_CT.w <- nb2listw(poly2nb(pl = Hamilton_CT))
  1. Binary, by distance (3 km threshold) including self.
Hamilton_CT.3knb <- Hamilton_CT %>% 
  st_centroid() %>%
  dnearneigh(d1 = 0, d2 = 3)
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over geometries of x
Hamilton_CT.3kw <- nb2listw(include.self(Hamilton_CT.3knb), style = "B")

You are now ready for the next activity.

26.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Create local Moran maps for the population in age group 20-34 and proportion of population in age group 20-34.

  2. Use the \(G_i^*\) statistic to analyze the population and proportion of population in the age group 20-34.

  3. Now create local Moran maps for the population and population density in the age group 20-34.

Activity Part II

  1. Concerning the analysis in point 1: What is the difference between using population (absolute) and population density (rate)?

  2. Concerning the analysis in point 2: What is the difference between using population (absolute) and proportion of population (rate)? Is there a reason to prefer either variable in analysis? Discuss.

  3. More generally, what do you think should guide the decision of whether to analyze variables as absolute values or rates?

27 Area Data V

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

27.1 Learning Objectives

In the previous chapter, you learned how to decompose Moran’s \(I\) coefficient into local versions of an autocorrelation statistic. You also learned about a concentration statistics, and saw how these local spatial statistics can be used for exploratory spatial data analysis, for example to search for “hot” and “cold” spots. In this practice, you will:

  1. Practice how to estimate regression models in R.
  2. Learn about autocorrelation as a model diagnostic.
  3. Learn about variable transformations.
  4. Use autocorrelation analysis to improve regression models.

27.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

27.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(plotly)
library(sf)
library(tidyverse)
library(units)
library(spdep)

Next, read an object of class sf (simple feature) with the census tracts of Hamilton CMA and some selected population variables from the 2011 Census of Canada. This dataset will be used for examples in this chapter:

data(Hamilton_CT)

27.4 Regression Analysis in R

Regression analysis is one of the most powerful techniques in the repertoire of data analysis. There are many different forms of regression, and they usually take the following form: \[ y_i = f(x_{ij}) + \epsilon_i \]

This is a model for a stochastic process. The outcome is \(y_i\), which could be the observed values of a variable \(y\) at locations \(i\). We will think of these locations as areas, but they could as well be points, nodes on a network, links on a network, etc. The model consists of two components: a systematic/deterministic part, that is \(f(x_{ij})\), which is a function of a collection of variables \(x_{i1}, x_{i2}, \cdots, x_{ij}, \cdots, x{ik}\); and a random part, captured by the term \(\epsilon_i\).

In this chapter we will deal with one specific form of regression, namely linear regression. A linear regression model posits (as the name implies) linear relationships between an outcome, called a dependent variable, and one or more covariates, called independent variables. It is important to note that regression models capture statistical relationships, not causal relationships. Even so, causality is often implied by the choice of independent variables. In a way, regression analysis is a tool to infer process from pattern: it is a formula that aims to retrieve the elements of the process based on our observations of the outcome.

This is the form of a linear regression model: \[ y_i = f(x_{ij}) + \epsilon_i = \beta_0 + \sum_{j=1}^k{\beta_jx_{ij}} + \epsilon_i \] where \(y_i\) is the dependent variable and \(x_ij\) (\(j=1,...,k\)) are the independent variables. The coefficients \(\beta\) are not known, but can be estimated from the data. And \(\epsilon_i\) is the random term, which in regression analysis is often called a residual (or error), because it is the difference between the systematic term of the model and the value of \(y_i\): \[ \epsilon_i = y_i - \bigg(\beta_0 + \sum_{j=1}^k{\beta_jx_{ij}}\bigg) \]

Estimation of a linear regression model is the procedure used to obtain values for the coefficients. This typically involves defining a loss function that needs to be minimized. In the case of linear regression, a widely used estimation procedure is least squares. This procedure allows a modeler to find the coefficients that minimize the sum of squared residuals, which become the loss function for the procedure. In very simple terms, the protocol is as follows: \[ \text{Find the values of }\beta\text{ that minimize }\sum_{i=1}^n{\epsilon_i^2} \]

For this procedure to be valid, there are a few assumptions that need to be satisfied, including:

  1. The functional form of the model is correct.

  2. The independent variables are not collinear; this is often diagnosed by calculating the correlations among the independent variables, with values greater than 0.8 often being problematic.

  3. The residuals have a mean of zero: \[ E[\epsilon_i|X]=0 \]

  4. The residuals have constant variance: \[ Var[\epsilon_i|X] = \sigma^2 \text{ }\forall i \]

  5. The residuals are independent, that is, they are not correlated among them: \[ E[\epsilon_i\epsilon_j|X] = 0 \text{ }\forall i\neq j \]

The last three assumptions ensure that the residuals are random. Violation of these assumptions is often a consequence of a failure in the first two (i.e., the model was not properly specified, and/or the residuals are not exogenous).

When all these assumptions are met, the coefficients are said to be BLUE: Best Linear Unbiased Estimates - a desirable property because we wish to be able to quantify the relationships between covariates without bias.

This section provides a refresher on linear regression, before reviewing the estimation of regression models in R. The basic command for multivariate linear regression in R is lm(), for “linear model”. This is the help file of this function:

# Remember that we can search the definition of a function by using a question mark in front of the function itself. 
?lm 

We will see now how to estimate a model using this function. The example we will use is of urban population density gradients. Population density gradients are representations of the variation of population density in cities. These gradients are of interest because they are related to land rent, urban form, and commuting patterns, among other things (see accompanying reading for more information).

Urban economic theory suggests that population density declines with distance from the central business district of a city, or its CBD. This leads to the following model, where the population density at location \(i\) is a function of the distance of \(i\) to the CBD. Since this is likely a stochastic process, we allow for some randomness by means of the residuals: \[ P_i = f(D_i) + \epsilon_i \]

To implement this model, we need to add distance to the CBD as a covariate in our dataframe. We will use Jackson Square, a central shopping mall in Hamilton, as the CBD of the city:

# Create a small data frame with the coordinates of Jackson Square; these coordinates,
# which are in lat-long are converted into a simple features table, with coordinate 
# reference system epsg:4326 (for lat-long); finally, we transform the coordinates to 
# the same coordinate reference system of our Hamilton census tracts, which we retrieve
# with the function `st_crs()`
xy_cbd <- data.frame(x = -79.8708,
                     y = 43.2584) %>%
  st_as_sf(coords = c("x", "y"),
           crs = 4326) %>%
  st_transform(st_crs(Hamilton_CT))

To calculate the distance from the census tracts to the CBD, we retrieve the centroids of the census tracts:

# We need to retrieve the centroids of Hamilton_CT by using 'coordinates' 
xy_ct <- st_centroid(Hamilton_CT)
## Warning in st_centroid.sf(Hamilton_CT): st_centroid assumes attributes are constant over geometries of x

Given these coordinates, the function geosphere::distGeo can be used to calculate the great circle distance between the centroids of the census tracts and Hamilton’s CBD. Call this dist2cbd.sl, i.e., straight line distance to CBD in a straight:

# Function `st_distance()` is used to calculate the distance between two sets
# of points. Here, we use it to calculate the distance from the centroids of 
# the census tracts to the coordinates of the CBD. We will call this variable
# `dist.sl`, for "straight line" to remind us what kind of distance this is. 
dist.sl <- st_distance(xy_ct,
                       xy_cbd)

Next. we add our new variable distance to CBD to our dataframe Hamilton_CT for analysis:

Hamilton_CT$dist.sl <- dist.sl

Regression analysis is implemented in R by means of the lm function. The arguments of the model include an object of type “formula” and a dataframe. Other arguments include conditions for subsetting the data, using sampling weights, and so on.

A formula is written in the form y ~ x1 + x2, and more complex expressions are possible too, as we will see below. For the time being, the formula is simply POP_DENSIT ~ dist.sl:

# The function `lm()` implements regression analysis in `R`. Recall that 'dist.sl' is the distance from the CBD (Jackson Square)
model1 <- lm(formula = POP_DENSITY ~ dist.sl, data = Hamilton_CT)
summary(model1) 
## 
## Call:
## lm(formula = POP_DENSITY ~ dist.sl, data = Hamilton_CT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3841.2 -1338.3  -177.1   950.8 10009.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4405.13325  250.16396  17.609  < 2e-16 ***
## dist.sl       -0.17994    0.02419  -7.439 3.63e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1892 on 186 degrees of freedom
## Multiple R-squared:  0.2293, Adjusted R-squared:  0.2251 
## F-statistic: 55.33 on 1 and 186 DF,  p-value: 3.633e-12

The value of the function is an object of class lm that contains the results of the estimation, including the coefficients with their diagnostics, and the coefficient of multiple determination, among other items.

Notice how the coefficient for distance is negative (and significant). This indicates that population density declines with increasing distance: \[ P_i = f(D_i) + \epsilon_i = 4405.15414 - 0.17989D_i + \epsilon_i \]

27.5 Autocorrelation as a Model Diagnostic

We can quickly explore the fit of the model. Since our model contains only one independent variable, we can use a scatterplot to see how it relates to population density. The points in the scatterplot are the actual population density and the distance to CBD. We also use the function geom_abline() to add the regression line to the plot, in blue:

ggplot(data = Hamilton_CT, aes(x = dist.sl, 
                               y = POP_DENSITY)) + 
  geom_point() +
  geom_abline(slope = model1$coefficients[2], # Recall that `geom_abline()` draws a line with intercept and slope as defined. Here the line is drawn using the coefficients of the regression model we estimated above. 
              intercept = model1$coefficients[1], 
              color = "blue", size = 1) +
  geom_vline(xintercept = 0) + # We also add the y axis... 
  geom_hline(yintercept = 0) # ...and the x axis.

Clearly, there remains a fair amount of noise after this model (the scatter of the dots around the regression line). In this case, the regression line captures the general trend of the data, but seems to underestimate most of the high population density areas closer to the CBD, and it also overestimates many of the low population areas.

If the pattern of under- and over-estimation is random (i.e., the residuals are random), that would indicate that the model successfully retrieved all the systematic pattern. If the pattern is not random, there is a violation of assumption of independence. To explore this issue, we will add the residuals of the model to the dataframe:

# Here we add the residuals from 'model1' to the dataframe, with the name `model1.e` 
Hamilton_CT$model1.e <- model1$residuals

Since we are interested in statistical maps, we will create a map of the residuals. In this map, we will use red to indicate negative residuals (values of the dependent variable that the model overestimates), and blue for positive residuals (values of the dependent variable that the model underestimates):

# Recall that 'plot_ly()' is a function used to create interactive plots
plot_ly(Hamilton_CT) %>% 
  # Recall that `add_sf()` is similar to `geom_sf()` and it draws a simple features object on a `plotly` plot. This example adds colors to represent positive (blue) and negative residuals (red).
  add_sf(type = "scatter",
         color = ~(model1.e > 0), 
         colors = c("red", 
                    "dodgerblue4")) 

In the legend of the plot, “TRUE” means that the residual is positive, and “FALSE” that it is negative. Does the spatial distribution of residuals look random?

In this case, visual inspection is very suggestive. In addition, we have the tools to help us with this question, in particular how to make a decision while quantifying our levels of confidence: the \(p\)-values of Moran’s \(I\) coefficient, for instance. We will create a set of spatial weights:

# Here, we use use `poly2nb()` to create a list of neighbors, based on the criterion of adjacency. Next, we pass that list of neighbors to `nb2listw()` to create a set of spatial weights.  
#Hamilton_CT.w <- Hamilton_CT.sp %>% 
Hamilton_CT.w <- Hamilton_CT %>% 
  poly2nb() %>%
  nb2listw() 

Once that we have a set of spatial weights, we can calculate Moran’s \(I\):

moran.test(Hamilton_CT$model1.e, 
           Hamilton_CT.w)
## 
##  Moran I test under randomisation
## 
## data:  Hamilton_CT$model1.e  
## weights: Hamilton_CT.w    
## 
## Moran I statistic standard deviate = 9.7448, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.395387645      -0.005347594       0.001691093

The results of Moran coefficient support our visual inspection of the map. Notice how we can reject the null hypothesis (spatial randomness) at a very high level of confidence (see the extremely small value of \(p\)).

Spatial autocorrelation, as mentioned above, is a violation of a key assumption of linear regression, and likely the consequence of a model that was not correctly specified, either because the functional form was incorrect (e.g., the relationship was not linear), or there are missing covariates.

We will explore the first of these possibilities by means of variable transformations.

27.6 Variable Transformations

The term linear regression refers to the linearity in the coefficients. Variable transformations allow you to consider non-linear relationships between covariates, while still preserving the linearity of the coefficients.

For instance, a possible transformation of the variable distance could be its inverse: \[ f(D_i) = \beta_0 + \beta_1\frac{1}{D_i} \]

Here, we will create a new covariate that is the inverse distance:

# Recall that the function `mutate()` adds new variables to an exist dataframe, while preserving those that already exist. Here, we use our variable with the distance to the CBD to create a new variable that is the inverse distance.
Hamilton_CT <- mutate(Hamilton_CT, 
                      invdist.sl = 1/dist.sl)

Once we have the inverse distance, we can estimate a second model using it as the covariate:

# Notice how the new 'model2' uses the inverse distance from the CBD rather than the original distance.
model2 <- lm(formula = POP_DENSITY ~ invdist.sl, 
             data = Hamilton_CT)
summary(model2) 
## 
## Call:
## lm(formula = POP_DENSITY ~ invdist.sl, data = Hamilton_CT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6763  -1375    -52   1108   9675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2299.6      164.4  13.988  < 2e-16 ***
## invdist.sl  2259818.2   341936.7   6.609 3.97e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1940 on 186 degrees of freedom
## Multiple R-squared:  0.1902, Adjusted R-squared:  0.1858 
## F-statistic: 43.68 on 1 and 186 DF,  p-value: 3.967e-10

As the scatterplot below shows (as before, the blue line is the regression line), we can capture a non-linear relationship. This model does a somewhat better job of describing the high density of tracts close to the CBD. Unfortunately, it is a poor description of density almost everywhere else:

ggplot(data = Hamilton_CT, 
       aes(x = dist.sl, 
           y = POP_DENSITY)) + 
  geom_point() +
  stat_function(fun=function(x)model2$coefficients[1] + model2$coefficients[2]/x, 
                geom="line", 
                color = "blue",
                size = 1) +
  geom_vline(xintercept = 0) + 
  geom_hline(yintercept = 0)

We will add the residuals of this model to the dataframe for further examination, in particular testing for spatial autocorrelation:

Hamilton_CT$model2.e <- model2$residuals

If we calculate Moran’s \(I\), we notice that the coefficient is lower than for the previous model but the \(p\)-value is still very low, which means that we can confidently reject the hypothesis that the residuals are random. But we would actually prefer to not reject this hypothesis, since we would like the residuals to be random!

moran.test(Hamilton_CT$model2.e, 
           Hamilton_CT.w)
## 
##  Moran I test under randomisation
## 
## data:  Hamilton_CT$model2.e  
## weights: Hamilton_CT.w    
## 
## Moran I statistic standard deviate = 8.8236, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.358134126      -0.005347594       0.001696970

The results of the test suggest that the model still fails at capturing the systematic aspects of population density gradients, so we need to investigate this further.

The literature on population density gradients suggests other non-linear transformations, including: \[ f(D_i) = exp(\beta_0)exp(\beta_1x_i) \]

This function is no longer linear in the coefficients (since the coefficients \(\beta_0\) and \(beta_1\) are transformed by the exponential). Fortunately, there is a simple way of changing this to a linear expression, by taking the logarithm on both sides of the equation: \[ ln(P_i) = \beta_0 + \beta_1x_i \]

By transforming the dependent variable we obtain a function that is linear in the parameters. To implement this model, we need to create a new variable that is the logarithm of population density:

# Here we mutate the population density by taking its natural logarithm of both sides of the equation. This changes the coefficients back to a linear expression.
Hamilton_CT <- Hamilton_CT %>%
  mutate(lnPOP_DEN = log(POP_DENSITY)) 

This allows us to estimate a third model, as follows:

model3 <- lm(formula = lnPOP_DEN ~ dist.sl, 
             data = Hamilton_CT)
summary(model3)
## 
## Call:
## lm(formula = lnPOP_DEN ~ dist.sl, data = Hamilton_CT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5857 -0.3395  0.2970  0.6897  2.4224 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.465e+00  1.588e-01  53.293  < 2e-16 ***
## dist.sl     -1.161e-04  1.536e-05  -7.561 1.77e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.202 on 186 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.231 
## F-statistic: 57.17 on 1 and 186 DF,  p-value: 1.773e-12

We can recreate the scatterplot and add the regression line. Notice that to create the line, we revert the coefficients to the exponential form of the model:

ggplot(data = Hamilton_CT, 
       aes(x = dist.sl, 
           y = POP_DENSITY)) + 
  geom_point() +
  stat_function(fun=function(x)exp(model3$coefficients[1] + model3$coefficients[2] * x), 
                geom="line", 
                color = "blue", 
                size = 1) +
  geom_vline(xintercept = 0) + 
  geom_hline(yintercept = 0)

As before, we can add the residuals of the model to the dataframe for further examination:

Hamilton_CT$model3.e <- model3$residuals

While this latest model provides a somewhat better fit, there is still systematic under- and over-prediction, as seen in the map below (red are negative residuals and blue are positive):

plot_ly(Hamilton_CT) %>%
  add_sf(type = "scatter",
         color = ~(model3.e > 0), 
         colors = c("red", 
                    "dodgerblue4"))

Moran’s \(I\) as well strongly suggests that the residuals are still not random/independent:

moran.test(Hamilton_CT$model3.e, 
           Hamilton_CT.w)
## 
##  Moran I test under randomisation
## 
## data:  Hamilton_CT$model3.e  
## weights: Hamilton_CT.w    
## 
## Moran I statistic standard deviate = 8.0158, p-value = 5.47e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.325935548      -0.005347594       0.001708056

27.7 A Note about Spatial Autocorrelation in Regression Analysis

Spatial autocorrelation was originally seen as a problem in regression analysis. It is not difficult to see why, after testing three models in this chapter.

My preference is to view spatial autocorrelation as an opportunity for discovery. For instance, the models above all seem to struggle to capture the large variations in population density between the central parts of the city and the suburbs of Hamilton. Perhaps this could be due to a regime change, or in other words, the presence of an underlying process that operates somewhat differently in different parts parts of the city. The latest model we estimated (model3), for instance, suggests that the close proximity of Burlington might have an effect.

The analysis that follows is somewhat more advanced, but serves to illustrate the idea of spatial autocorrelation as a tool for discovery.

We will begin by creating local Moran maps to identify potential “hot” and “cold” spots of population density. We can envision these as representing different spatial regimes:

localmoran.map(Hamilton_CT, 
               Hamilton_CT.w, 
               "POP_DENSITY", 
               by = "TRACT")

Examination of the map above, suggests that there are possibly three regimes: a CBD (“HH” and significant tracts), Suburbs (“LL” and significant tracts), and Other (not significant tracts). Based on this, we will create two indicator variables, one for census tracts in the CBD and another for census tracts in the Suburbs. An indicator variable takes values of 1 or zero, depending on whether a condition is true. For instance, all census tracts in the CBD will take the value of 1 in the CBD indicator variable, and all others will be zero.

Begin by computing the local statistics:

POP_DEN.lm <- localmoran(Hamilton_CT$POP_DENSITY, 
                         listw = Hamilton_CT.w)

colnames(POP_DEN.lm) <- c("Ii", "E.Ii", "Var.Ii", "Z.Ii", "p.val")

Next, we will identify the type of tract based on the spatial relationships according to the local statistics (i.e., “HH”, “LL”, or “HL/LH”).

df_msc <- Hamilton_CT %>%
  transmute(TRACT = TRACT,
            Z = (POP_DENSITY - mean(POP_DENSITY)) / var(POP_DENSITY),
            SMA = lag.listw(Hamilton_CT.w, Z),
            Type = case_when(Z < 0 & SMA < 0 ~ "LL",
                             Z > 0 & SMA > 0 ~ "HH",
                             TRUE ~ "HL/LH"))

After that, identify as CBD all tracts for which Type is “HH” and the p-value is less than or equal to 0.05. Likewise, identify as Suburb all tracts for which Type is “LL” and the \(p\)-value is also less than or equal to 0.05:

df_msc <- cbind(df_msc, 
                POP_DEN.lm)

CBD <- ifelse(df_msc$Type == "HH" & df_msc$p.val < 0.05, 
              1, 
              0)
Suburb <- ifelse(df_msc$Type == "LL" & df_msc$p.val < 0.05, 
                 1, 
                 0)

We then add the indicator variables to the dataframe:

Hamilton_CT$CBD <- CBD
Hamilton_CT$Suburb <- Suburb

The model that I propose to estimate is a variation of the last non-linear specification, but with regime breaks: \[ ln(P_i) = \beta_0 + \beta_1x_i + \beta_2CBD_i + \beta_3Suburb_i + \beta_4CBD_ix_i + \beta_5Suburb_ix_i + \epsilon_i \]

Since the indicator variables for CBD and Suburb take values of zero and one, effectively we have the following: \[ ln(P_i)=\Bigg\{\begin{array}{l l} (\beta_0 + \beta_2) + (\beta_1 + \beta_2)x_i + \epsilon_i \text{ if census tract } i \text{ is in the CBD}\\ (\beta_0 + \beta_3) + (\beta_1 + \beta_5)x_i + \epsilon_i \text{ if census tract } i \text{ is in the Suburbs}\\ \beta_0 + \beta_1x_i + \epsilon_i \text{ otherwise}\\ \end{array} \]

Notice that the model now allows for different slopes and intercepts for observations in different parts of the city. Estimate the model:

model4 <- lm(formula = lnPOP_DEN ~ CBD + Suburb + dist.sl + CBD:dist.sl + Suburb:dist.sl,
             data = Hamilton_CT)
summary(model4)
## 
## Call:
## lm(formula = lnPOP_DEN ~ CBD + Suburb + dist.sl + CBD:dist.sl + 
##     Suburb:dist.sl, data = Hamilton_CT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2749 -0.2739  0.2398  0.5639  2.1412 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.985e+00  1.730e-01  46.160  < 2e-16 ***
## CBD             9.573e-01  4.779e-01   2.003  0.04664 *  
## Suburb         -4.277e-01  6.140e-01  -0.697  0.48695    
## dist.sl        -4.569e-05  1.723e-05  -2.651  0.00872 ** 
## CBD:dist.sl    -5.005e-05  2.250e-04  -0.222  0.82423    
## Suburb:dist.sl -1.047e-04  4.149e-05  -2.523  0.01250 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.049 on 182 degrees of freedom
## Multiple R-squared:  0.429,  Adjusted R-squared:  0.4133 
## F-statistic: 27.35 on 5 and 182 DF,  p-value: < 2.2e-16

This model provides a much better fit than the preceding models (see the the coefficient of multiple determination).

We can visually examine the spatial distribution of the residuals by means of the following map:

Hamilton_CT$model4.e <- model4$residuals
plot_ly(Hamilton_CT) %>%
  add_sf(type = "scatter",
         color = ~(model4.e > 0), 
         colors = c("red", 
                    "dodgerblue4"))

It is not clear from the visual inspection that the residuals are independent, but this can be tested as usual by means of Moran’s \(I\) coefficient:

moran.test(Hamilton_CT$model4.e,
           Hamilton_CT.w)
## 
##  Moran I test under randomisation
## 
## data:  Hamilton_CT$model4.e  
## weights: Hamilton_CT.w    
## 
## Moran I statistic standard deviate = 2.267, p-value = 0.0117
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.086767905      -0.005347594       0.001651087

Based on the results, we can still reject the null hypothesis at a high level of confidence (since the \(p\)-value is 0.0117); however we also see that the model has been able to absorb more of the residual autocorrelation than the preceding alternatives, and provides a better statistical fit to the variable population density (with a higher \(R^2\)).

The following figure illustrates this last model:

# We will create three functions to represent each of the three regimes in `model4`
fun.1 <- function(x)exp(model4$coefficients[1] + model4$coefficients[2] + (model4$coefficients[4]  + model4$coefficients[4]) * x) #CBD
fun.2 <- function(x)exp(model4$coefficients[1] + model4$coefficients[3] + (model4$coefficients[4] + model4$coefficients[6]) * x) #Suburb
fun.3 <- function(x)exp(model4$coefficients[1] + model4$coefficients[4] * x) #Other

ggplot(data = Hamilton_CT, aes(x = dist.sl, y = POP_DENSITY)) +
  geom_point() +
  geom_point(data = filter(Hamilton_CT, CBD == 1), color = "Red") +
  geom_point(data = filter(Hamilton_CT, Suburb == 1), color = "Blue") +
  # `stat_function()` draws custom functions on a `ggplot2` plot.
  stat_function(fun= fun.1, 
                geom="line", size = 1, aes(color = "CBD")) +
  stat_function(fun=fun.2, 
                geom="line", size = 1, aes(color = "Suburb")) +
  stat_function(fun=fun.3, 
                geom="line", size = 1, aes(color = "Other")) +
  # Set the colors of the regression lines
  scale_color_manual(values = c("CBD" = "red", "Other" = "black", "Suburb" = "blue")) +
  geom_vline(xintercept = 0) + # Add the y axis...
  geom_hline(yintercept = 0) # ...and the x axis.

This example illustrate how spatial exploratory analysis can provide valuable insights to improve our models, and in turn hopefully develop a better understanding of the underlying process. What can you say about population density in Hamilton based on this model?

28 Activity 13: Area Data V

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

28.1 Practice questions

Answer the following questions:

  1. Explain the main assumptions for linear regression models.
  2. How is Moran’s \(I\) used as a diagnostic in regression analysis?
  3. Residual spatial autocorrelation is symptomatic of what issues in regression analysis?
  4. What does it mean for a model to be linear in the coefficients?
  5. What is the purpose of transforming variables for regression analysis?

28.2 Learning objectives

In this activity, you will:

  1. Explore a spatial dataset.
  2. Conduct linear regression analysis.
  3. Conduct diagnostics for residual spatial autocorrelation.
  4. Propose ways to improve your analysis.

28.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

28.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity. In addition to tidyverse, you will need sf and isdas:

library(isdas)
library(tidyverse)
library(sf)
library(spdep)

Begin by loading the data files you will use in this activity:

data("HamiltonDAs")
data("trips_by_mode")
data("travel_time_car")

HamiltonDAs are the Dissemination Areas for Hamilton CMA, which coincide with the Traffic Analysis Zones (TAZ) of the Transportation Tomorrow Survey of 2011. The dataframe trips_by_mode includes the number of trips by mode of transportation by TAZ (equivalently DA), as well as other useful information from the 2011 census for Hamilton CMA. Finally, the dataframe travel_time_car includes the travel distance/time from TAZ/DA centroids to Jackson Square in downtown Hamilton.

The data for this activity were retrieved from the 2011 Transportation Tomorrow Survey TTS, the periodic travel survey of the Greater Toronto and Hamilton Area, as well as data from the 2011 Canadian Census Census Program.

Before beginning the activity, join the information on trips and travel time to the sf object. Note that to complete the join, the identifier (in this case GTA06) must be in the same format in both data frames:

travel_time_car$GTA06 <- factor(travel_time_car$GTA06)

# Travel time
HamiltonDAs <- left_join(HamiltonDAs, travel_time_car, by = "GTA06")
# Trips by mode
HamiltonDAs <- left_join(HamiltonDAs, trips_by_mode, by = "GTA06")

The analysis will be based on travel by car in the Hamilton CMA. Calculate the proportion of trips by car by TAZ:

HamiltonDAs <- mutate(HamiltonDAs, Auto_driver.prop = Auto_driver / (Auto_driver + Cycle + Walk))

Note that the proportion of people who traveled by car as passengers are not included in the denominator of the proportion! This is because every trip as a passenger is already included in trips with one driver.

28.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Examine your dataframe. What variables are included? Are there any missing values?

  2. Map the variable Auto_driver.prop, and use Moran’s I to test for spatial autocorrelation.

  3. Estimate regression model using the variables Pop_Density and travel time in minutes.

Activity Part II

  1. What does the analysis of autocorrelation in point 2* tell you about Auto_driver.prop? Would you say that autocorrelation in this variable is a sign that autocorrelation will be an issue in regression analysis? Why or why not?

  2. Discuss the model you estimated in point 3. Next, examine its residuals. Would you say that they are spatially random/independent?

  3. Propose ways to improve your model.

29 Area Data VI

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

29.1 Learning Objectives

In the previous chapter, you practiced how to estimate linear regression models in R, learned about the use of Moran’s \(I\) as a diagnostic tool for regression residuals, and learned how the use local spatial statistics to support model-building. In this practice, you will:

  1. Revisit the notion of autocorrelation as a model diagnostic.
  2. Remedial action when the residuals are autocorrelated.
  3. Flexible functional forms and models with spatially-varying coefficients. 3.1 Trend surface analysis. 3.2 The expansion method. 3.3 Geographically weighted regression (GWR).
  4. Spatial error model (SEM).

29.2 Suggested Readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapter 7. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 9. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 7. Sage: Los Angeles.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.

29.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(kableExtra)
library(plotly)
library(sf)
library(spatialreg)
library(spdep)
library(spgwr)
library(tidyverse)

Begin by loading the data needed for this chapter:

data("HamiltonDAs")

This is a simple features sf object with the the Dissemination Areas in the Hamilton Census Metropolitan Area, in Canada, and it includes five simulated variables.

29.4 Residual spatial autocorrelation revisited

A key assumption about the residuals of a regression model is that they are random, which means that they cannot have residual systematic pattern. Previously you learned about the use of Moran’s \(I\) coefficient as a diagnostic in regression analysis. The residuals of a model can be mapped and examined for pattern, and Moran’s \(I\) used to test the hypothesis that they are spatially random. When we reject the null hypothesis and conclude that the residuals are not random, this is a symptom of a model that has not been properly specified.

Here, we will focus on two reasons for this that are of interest:

  1. The functional form is incorrect.
  2. The model failed to include relevant variables.

We will explore these in turn.

29.4.1 Incorrect Functional Form

As we say in the preceding chapter, linear regression means that the parameters of the model are linear. However, life is not always linear, and an incorrect functional form can lead to residual spatial autocorrelation (McMillen 2003). To illustrate this, we will consider a spatial process as follows: \[ z = f(u,v) = exp(\beta_0)exp(\beta_1u)exp(\beta_2v) \]

where \(u\) and \(v\) are spatial coordinates. This is a non-linear spatial process, since the relationship between the coefficients and the outcome is not linear. We can simulate this process if we use the spatial coordinates of the dissemination areas in our example. The simulation is as follows, with a residual term with a mean of zero and standard deviation of 1. Notice that the residuals are random by design:

# The function `set.seed()` is used to fix the seed for the random number generator. This ensures that the simulation is replicable.
set.seed(10)

# Set the coefficients of the model for the simulations.
b0 = 1
b1 = 2
b2 = 4

# Retrieve the coordinates of the centroids of the dissemination areas.
uv_coords <- st_coordinates(st_centroid(HamiltonDAs))
## Warning in st_centroid.sf(HamiltonDAs): st_centroid assumes attributes are constant over geometries of x
# Add the coordinates of the centroids to the dataframe, but first transform them so that they have a false origin at the minimum values of u and v, and scaled by 100000. Simulate variable z using the coefficients defined above, the transformed coordinates, and add a random component with a mean of zero and standard deviation of one.
HamiltonDAs <- HamiltonDAs %>%
  mutate(u = (uv_coords[,1] - min(uv_coords[,1]))/100000,
         v = (uv_coords[,2] - min(uv_coords[,2]))/100000,
         z = exp(b0) * exp(b1 * u) * exp(b2 * v) +
           rnorm(n = 297, mean = 0, sd = 1))

This is the summary of the simulated variables:

HamiltonDAs %>% 
  select(u, v, z) %>%
  summary()
##        u                v                z                   geometry  
##  Min.   :0.0000   Min.   :0.0000   Min.   : 3.919   MULTIPOLYGON :297  
##  1st Qu.:0.2284   1st Qu.:0.1354   1st Qu.: 7.842   epsg:26917   :  0  
##  Median :0.2695   Median :0.1712   Median : 9.370   +proj=utm ...:  0  
##  Mean   :0.2724   Mean   :0.1863   Mean   :10.370                      
##  3rd Qu.:0.3127   3rd Qu.:0.2195   3rd Qu.:11.710                      
##  Max.   :0.5312   Max.   :0.4079   Max.   :22.809

Suppose that we estimate the model as a linear regression that fails to correctly capture the non-linearity by specifying linear parameters. The model would be as follows:

model1 <- lm(formula = z ~ u + v, data = HamiltonDAs) 
summary(model1)
## 
## Call:
## lm(formula = z ~ u + v, data = HamiltonDAs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7267 -0.8591  0.0028  0.8250  3.5826 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.6765     0.3255  -11.29   <2e-16 ***
## u            20.9207     0.8586   24.37   <2e-16 ***
## v            44.8033     0.9305   48.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.231 on 294 degrees of freedom
## Multiple R-squared:  0.8965, Adjusted R-squared:  0.8958 
## F-statistic:  1273 on 2 and 294 DF,  p-value: < 2.2e-16

At first glance, the model gives the impression of a very good fit: all coefficients are significant, and the coefficient of multiple determination \(R^2\) is high. However, at this point it is important to examine the residuals to verify that they are independent. We will add the residuals of this model to the dataframe for visualization:

# Copy the residuals of the model to the dataframe for mapping
HamiltonDAs$model1.e <- model1$residuals

A map of the residuals can help s to examine their spatial pattern (negative residuals are red, positive are blue):

# Create a `plotly` object with the dataframe and plot the the simple features object with colors per the sign of the residuals (negative residuals = FALSE, positive residuals = TRUE)
  plot_ly(HamiltonDAs) %>%
    add_sf(color = ~ifelse(model1.e > 0, "Positive", "Negative"), colors = c("red", "dodgerblue4"))

Visual inspection of the spatial distribution of residuals is suggestive. Positive residuals mean that the model underestimates the values of the dependent variable, and negative that the model overestimates the values of the dependent variable. The model systematically underestimates the variable along a north-south band that crosses the center of the region, and overestimates systematically to the east and west. While it is quite clear that there is systematic residual pattern, it is important to support our visual inspection of the residuals by testing for spatial residual autocorrelations.

To do this, we need to create a set of spatial weights:

HamiltonDAs.w <- HamiltonDAs %>%
  as("Spatial") %>%
  poly2nb() %>%
  nb2listw()

Once that we have a set of spatial weights, we can proceed to calculate Moran’s \(I\):

moran.test(HamiltonDAs$model1.e, HamiltonDAs.w)
## 
##  Moran I test under randomisation
## 
## data:  HamiltonDAs$model1.e  
## weights: HamiltonDAs.w    
## 
## Moran I statistic standard deviate = 10.373, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.350300067      -0.003378378       0.001162633

Notice the very small \(p\)-value: this result means that we can comfortably reject the null hypothesis of spatial randomness; however, what we wish is the opposite, since we want the residuals to be spatially random! Thus, despite the apparent goodness of fit of the model, there is reason to believe something is amiss with the model (since we simulated it, we know that the problem is that the model should not be linear).

The results of testing for autocorrelation indicate an issue with the model, which in this case is an incorrect specification. This is fixed if we use a variable transformation to approximate the underlying non-linear process. We can take the logarithm on both sides of the equation. On the left hand side, we are left with \(log(z)\). On the right hand, the products become a sum, and the logarithms and exponentials cancel each other, to give: \[ log(z) = log\big(exp(\beta_0)exp(\beta_1u)exp(\beta_2v)\big) = log\big(exp(\beta_0)\big) + log\big(exp(\beta_1u)\big) + log\big(exp(\beta_2v)\big) = \beta_0 + \beta_1u + \beta_2v \]

which is now a linear model. This is called a log-transformation. The log-transformed model is estimated as follows:

model2 <- lm(formula = log(z) ~ u + v, data = HamiltonDAs)
summary(model2)
## 
## Call:
## lm(formula = log(z) ~ u + v, data = HamiltonDAs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32033 -0.06456  0.00671  0.07647  0.31233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.96853    0.02864   33.81   <2e-16 ***
## u            2.08863    0.07554   27.65   <2e-16 ***
## v            3.97537    0.08187   48.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1083 on 294 degrees of freedom
## Multiple R-squared:  0.9016, Adjusted R-squared:  0.901 
## F-statistic:  1348 on 2 and 294 DF,  p-value: < 2.2e-16

This model does not necessarily have a better goodness of fit. However, when we test for spatial autocorrelation:

HamiltonDAs$model2.e <- model2$residuals
moran.test(HamiltonDAs$model2.e, HamiltonDAs.w)
## 
##  Moran I test under randomisation
## 
## data:  HamiltonDAs$model2.e  
## weights: HamiltonDAs.w    
## 
## Moran I statistic standard deviate = 0.59638, p-value = 0.2755
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.016946454      -0.003378378       0.001161482

Once that the correct functional form has been specified, the model is better at capturing the underlying process (check how the coefficients closely approximate the true coefficients of the model). In addition, we can conclude that the residuals are random, and therefore are now also spatially random: meaning the there is nothing left of the process but white noise.

29.4.2 Omitted Variables

Using the same example, suppose now that the functional form of the model is correctly specified, but a relevant variable is missing:

model3 <- lm(formula = log(z) ~ u, data = HamiltonDAs)
summary(model3)
## 
## Call:
## lm(formula = log(z) ~ u, data = HamiltonDAs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78563 -0.19306 -0.05461  0.14453  0.91857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.90764    0.06334  30.118  < 2e-16 ***
## u            1.36012    0.22197   6.127 2.85e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3246 on 295 degrees of freedom
## Multiple R-squared:  0.1129, Adjusted R-squared:  0.1099 
## F-statistic: 37.54 on 1 and 295 DF,  p-value: 2.853e-09

As before, we will append the residuals to the dataframes:

HamiltonDAs$model3.e <- model3$residuals

We can plot a map of the residuals to examine their spatial pattern (negative residuals are red, positive are blue):

  plot_ly(HamiltonDAs) %>%
    add_sf(type = "scatter",
           color = ~ifelse(model3.e > 0, 
                           "Positive", 
                           "Negative"), 
           colors = c("red", 
                      "dodgerblue4"))

In this case, the visual inspection makes it clear that there is an issue with residual spatial pattern, and using Moran’s \(I\) we can conclude that the residuals are spatially autocorrelated:

moran.test(HamiltonDAs$model3.e, HamiltonDAs.w)
## 
##  Moran I test under randomisation
## 
## data:  HamiltonDAs$model3.e  
## weights: HamiltonDAs.w    
## 
## Moran I statistic standard deviate = 24.921, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.846098172      -0.003378378       0.001161895

As seen above, the model with the full set of relevant variables resolves this problem.

29.5 Remedial Action

When spatial autocorrelation is detected in the residuals, further work is warranted. The preceding examples illustrate two possible solutions to the issue of residual pattern:

  1. Modifications of the model to approximate the true functional form of the process; and
  2. Inclusion of relevant variables.

Ideally, we would try to ensure that the model is properly specified. In practice, however, it is not always evident what the functional form of the model should be. The search for an appropriate functional form can be guided by theoretical considerations, empirical findings, and experimentation. With respect to inclusion of relevant variables, it is not always possible to find all the information we desire. This could be because of limited resources, or because some aspects of the process are not known and therefore we do not even know what additional information should be collected.

In these cases, it is a fact that residual spatial autocorrelation is problematic.

Fortunately, a number of approaches have been proposed in the literature that can be used for remedial action.

In the following sections we will review some of them.

29.6 Flexible Functional Forms and Models with Spatially-varying Coefficients

Some models use variable transformations to create more flexible functions, while others use adaptive estimation strategies.

29.6.1 Trend Surface Analysis

Trend surface analysis is a simple way to generate relatively flexible regression models with surfaces that are not necessarily linear. This approach consists of using the coordinates as covariates, and transforming them into polynomials of different orders. Seen this way, linear regression is the analog of a trend surface of first degree: \[ z = f(x,y) = \beta_0 + \beta_1u + \beta_2v \] where again \(u\) and \(v\) are the coordinates.

A figure illustrates how the function above creates a regression plane. To visualize this, we need to create a grid of coordinates for plotting:

# The function `expand.grid()` takes two arguments and creates a dataframe from all the combinations of the values. The function `seq()` creates a vector with values starting at `from`, ending at `to`, with step increments given `by`. Here, we create a grid with values in `u` from -2 to 2 and values in `v` from -2 to 2.
df <- expand.grid(u = seq(from = -2, to = 2, by = 0.2), v = seq(from = -2, to = 2, by = 0.2))

Next, select some values for the coefficients (feel free to experiment with these values):

# Define some coefficients (you can change these values if you wish).
b0 <- 0.5 #0.5
b1 <- 1 #1
b2 <- 2 #2

# Create the regression plane. We did not add a random term here because this plane is the systematic component of the model!
z1 <- b0 + b1 * df$u + b2 * df$v
z1 <- matrix(z1, nrow = 21, ncol = 21)

The plot is as follows:

# Create a `plotly` object and add a surface.
plot_ly(z = ~z1) %>% 
  add_surface() %>%
  # The function `layout()` defines several aspects of the plot, in this case the labels for the ticks on the axes and the axes titles. 
  layout(scene = list(xaxis = list(ticktext = c("-2", "0", "2"), 
                                   tickvals = c(0, 10, 20)), 
                      yaxis = list(ticktext = c("-2", "0", "2"), 
                                   tickvals = c(0, 10, 20)),
                      xaxis = list(title = "v"),
                      yaxis = list(title = "u")
                      )
         )

The figure above is a linear trend surface, and we can see that the dependent variable z grows as u and v grow.

Higher order trend surfaces can be defined as well. For example, a trend surface of second degree (or quadratic), would be as follows. Notice how it includes all possible quadratic terms, including the product \(xy\): \[ z = f(x,y) = \beta_0 + \beta_1u^2 + \beta_2u + \beta_3uv + \beta_4v + \beta_5v^2 \]

Use the same grid as above to create now a regression surface. Select some coefficients:

b0 <- 0.5 #0.5
b1 <- 2 #2
b2 <- 1 #1
b3 <- 1 #1
b4 <- 1.5 #1.5
b5 <- 0.5 #2.5
z2 <- b0 + b1 * df$u^2 + b2 * df$u + b3 * df$u * df$v + b4 * df$v + b5 * df$v^2
z2 <- matrix(z2, nrow = 21, ncol = 21)

And the plot is as follows:

plot_ly(z = ~z2) %>% add_surface() %>%
  layout(scene = list(xaxis = list(ticktext = c("-2", "0", "2"), tickvals = c(0, 10, 20)), 
                      yaxis = list(ticktext = c("-2", "0", "2"), tickvals = c(0, 10, 20)),
                      xaxis = list(title = "v"), 
                      yaxis = list(title = "u")
                      )
         )

Higher order polynomials (i.e., cubic, quartic, etc.) are possible in principle. Something to keep in mind is that the higher the order of the polynomial, the more flexible the surface, which may lead to the following issues:

  1. Multicollinearity.
Powers of variables tend to be highly correlated with each other. See the following table of correlations for the u coordinate in the example:
u u^2 u^3 u^4
u 1.00 0.00 0.92 0.00
u^2 0.00 1.00 0.00 0.96
u^3 0.92 0.00 1.00 0.00
u^4 0.00 0.96 0.00 1.00

When two variables are highly collinear, the model has difficulties discriminating their relative contribution to the model. This is manifested by inflated standard errors that may depress the significance of the coefficients, and occasionally by sign reversals.

  1. Overfitting.

Overfitting is another possible consequence of using a trend surface that is too flexible. This happens when a model fits too well the observations used for calibration, but because of this it may fail to fit well new information.

To illustrate overfitting consider a simple example. Below we simulate a simple linear model with \(y_i = x_i + \epsilon_i\) (the random terms are drawn from the uniform distribution). We also simulate new data using the exact same process:

# Dataset for estimation
df.of1 <- data.frame(x = seq(from = 1, to = 10, by = 1))
df.of1 <- mutate(df.of1, y = x + runif(10, -1, 1))
# New data
new_data <- data.frame(x = seq(from = 1, to = 10, by = 0.5))
df.of2 <- mutate(new_data, y = x + runif(nrow(new_data), -1, 1))

This is the scatterplot of the observations in the estimation dataset:

p <- ggplot(data = df.of1, aes(x = x, y = y)) 
p + geom_point(size = 3)

A model with a first order trend (essentially linear regression), does not fit the observations perfectly, but when confronted with new data (plotted as red squares), it predicts them with reasonable accuracy:

mod.of1 <- lm(formula = y ~ x, data = df.of1)
pred1 <- predict(mod.of1, newdata = new_data) #mod.of1$fitted.values
p + geom_abline(slope = mod.of1$coefficients[2], intercept = mod.of1$coefficients[1], 
                color = "blue", size = 1) +
  geom_point(data = df.of2, aes(x = x, y = y), shape = 0, color = "red") +
  geom_segment(data = df.of2, aes(xend = x, yend = pred1)) + 
  geom_point(size = 3) +
  xlim(c(1, 10))

Compare to a polynomial of very high degree (nine in this case). The model is much more flexible, to the extent that it perfectly matches the observations in the estimation dataset. However, this flexibility has a major downside. When the model is confronted with new information, its performance is less satisfactory.

mod.of2 <- lm(formula = y ~ poly(x, degree = 9, raw = TRUE), data = df.of1)
poly.fun <- predict(mod.of2, data.frame(x = seq(1, 10, 0.1)))
pred2 <- predict(mod.of2, newdata = new_data) #mod.of1$fitted.values

p + geom_line(data = data.frame(x = seq(1, 10, 0.1), y = poly.fun), 
              aes(x = x, y = y),
              color = "blue", size = 1) + 
  geom_point(data = df.of2, 
             aes(x = x, y = y), 
             shape = 0, 
             color = "red") +
  geom_segment(data = df.of2, 
               aes(xend = x, yend = pred2)) + 
  geom_point(size = 3) +
  xlim(c(1, 10))

We can compute the root mean square (RMS), for each of the two models. The RMS is a measure of error that is calculated as the square root of the mean of the squared differences between two values (in this case the prediction of the model and the new information). This statistic is a measure of the typical deviation between two sets of values. Given new information, the RMS would tell us the expected size of the error when making a prediction using a given model.

The RMS for model 1 is:

sqrt(mean((df.of2$y - pred1)^2))
## [1] 0.525595

And for model 2:

sqrt(mean((df.of2$y - pred2)^2))
## [1] 1.681143

You will notice how model 2, despite fitting the estimation data better than model 1, typically produces larger errors when new information becomes available.

  1. Edge effects.

Another consequence of overfitting, is that the resulting functions tend to display extreme behavior when taken outside of their estimation range, where the largest polynomial terms tend to dominate.

The plot below is the same high degree polynomial estimated above, just plotted in a slightly larger range of plus/minus one unit:

poly.fun <- predict(mod.of2, data.frame(x = seq(0, 11, 0.1)))
p + 
  geom_line(data = data.frame(x = seq(0, 11, 0.1), y = poly.fun), aes(x = x, y = y),
                color = "blue", size = 1) + 
  geom_point(data = df.of2, aes(x = x, y = y), shape = 0, color = "red") +
  geom_segment(data = df.of2, aes(xend = x, yend = pred2)) + 
  geom_point(size = 3)

29.6.2 Models with Spatially-varying Coefficients

Another way to generate flexible functional forms is by means of models with spatially varying coefficients. Two approaches are reviewed here.

29.6.2.1 Expansion Method

The expansion method (Casetti, 1972) is an approach to generate models with contextual effects. It follows a philosophy of specifying first a substantive model with variables of interest, and then an expanded model with contextual variables. In geographical analysis, typically the contextual variables are trend surfaces estimated using the coordinates of the observations.

To illustrate this, suppose that there is the following initial model of proportion of donors in a population, with two variables of substantive interest (say, income and education): \[ d_i = \beta_i(u_i,v_i) + \beta_1(u_i,v_i)I_i + \beta_3(u_i,v_i)Ed_i + \epsilon_i \]

Note how the coefficients are now a function of the coordinates at \(i\). Unlike previous models that had global coefficients, the coefficients in this model are allowed to adapt by location.

Unfortunately, it is not possible to estimate one coefficient per location. In this case, there are \(n\times k\) coefficients, which exceeds the size of the sample (\(n\)). It is not possible to retrieve more information from the sample than \(n\) parameters (this is called the incidental parameter problem.)

A possible solution is to specify a function for the coefficients, for instance, by specifying a trend surface for them: \[ \begin{array}{l} \beta_0(u_i, v_i) = \beta_{01} +\beta_{02}u_i + \beta_{03}v_i\\ \beta_1(u_i, v_i) = \beta_{11} +\beta_{12}u_i + \beta_{13}v_i\\ \beta_2(u_i, v_i) = \beta_{21} +\beta_{22}u_i + \beta_{23}v_i \end{array} \] By specifying the coefficients as a function of the coordinates, we allow them to vary by location.

Next, if we substitute these coefficients in the initial model, we arrive at a final expanded model: \[ d_i = \beta_{01} +\beta_{02}u_i + \beta_{03}v_i + \beta_{11}I_i +\beta_{12}u_iI_i + \beta_{13}v_iI_i + \beta_{21}Ed_i +\beta_{22}u_iEd_i + \beta_{23}v_iEd_i + \epsilon_i \]

This model has now nine coefficients, instead of \(n\times 3\), and can be estimated as usual.

It is important to note that since models generated based on the expansion method are based on the use of trend surfaces, similar caveats apply with respect to multicollinearity and overfitting.

29.6.2.2 Geographically Weighted Regression (GWR)

A different strategy to estimate models with spatially-varying coefficients is a semi-parametric approach, called geographically weighted regression (see Brunsdon et al., 1996).

Instead of selecting a functional form for the coefficients as the expansion method does, the functions are left unspecified. The spatial variation of the coefficients results from an estimation strategy that takes subsamples of the data in a systematic way.

If you recall kernel density analysis, a kernel was a way of weighting observations based on their distance from a focal point.

Geographically weighted regression applies a similar concept, with a moving window that visits a focal point and estimates a weighted least squares model at that location. The results of the regression are conventionally applied to the focal point, in such a way that not only the coefficients are localized, but also every other regression diagnostic (e.g., the coefficient of determination, the standard deviation, etc.)

A key aspect of implementing this model is the selection of the kernel bandwidth, that is, the size of the window. If the window is too large, the local models tend towards the global model (estimated using the whole sample). If the window is too small, the model tends to overfit, since in the limit each window will contain only one, or a very small number of observations.

The kernel bandwidth can be selected if we define some loss function that we wish to minimize. A conventional approach (but not the only one), is to minimize a cross-validation score of the following form: \[ CV (\delta) = \sum_{i=1}^n{\big(y_i - \hat{y}_{\neq i}(\delta)\big)^2} \] In this notation, \(\delta\) is the bandwidth, and \(\hat{y}_{\neq i}(\delta)\) is the value of \(y\) predicted by a model with a bandwidth of \(\delta\) after excluding the observation at \(i\). This is called a leave-one-out cross-validation procedure, used to prevent the estimation from shrinking the bandwidth to zero.

GWR is implemented in R in the package spgwr. To estimate models using this approach, the function sel.GWR, which takes as inputs a formula specifying the dependent and independent variables, a SpatialPolygonsDataFrame (or a SpatialPointsDataFrame), and the kernel function (in the example below a Gaussian kernel). Since our data come in the form of simple features, we use as(x, "Spatial") to convert to a Spatial*DataFrame object:

delta <- gwr.sel(formula = z ~ u + v, 
                 data = as(HamiltonDAs, "Spatial"), 
                 gweight = gwr.Gauss)
## Bandwidth: 25621.66 CV score: 416.6583 
## Bandwidth: 41415.33 CV score: 439.9313 
## Bandwidth: 15860.64 CV score: 373.3401 
## Bandwidth: 9827.993 CV score: 326.3479 
## Bandwidth: 6099.614 CV score: 301.3906 
## Bandwidth: 3795.349 CV score: 307.3175 
## Bandwidth: 5784.775 CV score: 300.0247 
## Bandwidth: 5317.712 CV score: 298.6785 
## Bandwidth: 4736.221 CV score: 298.7873 
## Bandwidth: 5058.919 CV score: 298.4138 
## Bandwidth: 5051.908 CV score: 298.4127 
## Bandwidth: 5032.504 CV score: 298.4117 
## Bandwidth: 5034.856 CV score: 298.4117 
## Bandwidth: 5034.926 CV score: 298.4117 
## Bandwidth: 5034.918 CV score: 298.4117 
## Bandwidth: 5034.918 CV score: 298.4117 
## Bandwidth: 5034.918 CV score: 298.4117 
## Bandwidth: 5034.921 CV score: 298.4117 
## Bandwidth: 5034.919 CV score: 298.4117 
## Bandwidth: 5034.918 CV score: 298.4117 
## Bandwidth: 5034.918 CV score: 298.4117 
## Bandwidth: 5034.918 CV score: 298.4117

The function gwr estimates the suite of local models given a bandwidth:

model.gwr <- gwr(formula = z ~ u + v, 
                 bandwidth = delta, 
                 data = as(HamiltonDAs, "Spatial"),
                 gweight = gwr.Gauss)
model.gwr
## Call:
## gwr(formula = z ~ u + v, data = as(HamiltonDAs, "Spatial"), bandwidth = delta, 
##     gweight = gwr.Gauss)
## Kernel function: gwr.Gauss 
## Fixed bandwidth: 5034.918 
## Summary of GWR coefficient estimates at data points:
##                  Min.  1st Qu.   Median  3rd Qu.     Max.  Global
## X.Intercept. -16.8369  -5.8339  -2.0390  -0.6852   2.0016 -3.6765
## u              6.1497  16.5814  19.1775  24.9633  36.8438 20.9207
## v             22.3026  31.5813  36.8539  47.5515  84.3637 44.8033

The results are given for each location where a local regression was estimated. We can join these results to our sf dataframe for plotting:

HamiltonDAs$beta0 <- model.gwr$SDF@data$X.Intercept.
HamiltonDAs$beta1 <- model.gwr$SDF@data$u
HamiltonDAs$beta2 <- model.gwr$SDF@data$v
HamiltonDAs$localR2 <- model.gwr$SDF@data$localR2
HamiltonDAs$gwr.e <- model.gwr$SDF@data$gwr.e

The results can be mapped as shown below (try mapping beta1, beta2, localR2, or the residuals gwr.e):

ggplot(data = HamiltonDAs, aes(fill = beta0)) + 
  geom_sf(color = "white") +
  scale_fill_distiller(palette = "YlOrRd", trans = "reverse")

You can verify that the residuals are not spatially autocorrelated:

moran.test(HamiltonDAs$gwr.e, HamiltonDAs.w)
## 
##  Moran I test under randomisation
## 
## data:  HamiltonDAs$gwr.e  
## weights: HamiltonDAs.w    
## 
## Moran I statistic standard deviate = 0.016155, p-value = 0.4936
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.002827158      -0.003378378       0.001164184

Some caveats with respect to GWR.

Since estimation requires the selection of a kernel bandwidth, and a kernel bandwidth requires the estimation of many times leave-one-out regressions, GWR can be computationally demanding, especially for large datasets.

GWR has become a very popular method, however, there is conflicting evidence regarding its ability to retrieve a known spatial process (Paez, Farber, and Wheeler 2011). For this reasons, interpretation of the spatially-varying coefficients must be conducted with a grain of salt, although this seems to be less of a concern with larger samples - but at the moment it is not known how large a sample is safe (and larger samples also become computationally more demanding). As well, the estimation method is known to be sensitive to unusual observations (Farber and Páez 2007). At the moment, I recommend that GWR be used for prediction only, and in this respect it seems to perform as well, or even better than alternatives approaches (Paez, Long, and Farber 2008).

29.7 Spatial Error Model (SEM)

A model that can be used to take direct remedial action with respect to residual spatial autocorrelation is the spatial error model.

This model is specified as follows: \[ y_i = \beta_0 + \sum_{j=1}^k{\beta_kx_{ij}} + \epsilon_i \]

However, it is no longer assumed that the residuals \(\epsilon\) are independent, but instead display map pattern, in the shape of a moving average: \[ \epsilon_i = \lambda\sum_{i=1}^n{w_{ij}^{st}\epsilon_i} + \mu_i \]

A second set of residuals \(\mu\) are assumed to be independent.

It is possible to show that this model is no longer linear in the coefficients (but this would require a little bit of matrix algebra). For this reason, ordinary least squares is no longer an appropriate estimation algorithm, and models of this kind are instead usually estimated based on a method called maximum likelihood [which we will not cover in detail here; you can read about it in Anselin (1988)].

Spatial error models are implemented in the package spatialreg.

As a remedial model, it can account for a model with a misspecified functional form. We know that the underlying process is not linear, but we specify a linear relationship between the covariates in the form of \(z = \beta_0 + \beta_1u + \beta_2v\):

model.sem1 <- errorsarlm(formula = z ~ u + v, 
                        data = HamiltonDAs, 
                        listw = HamiltonDAs.w)
summary(model.sem1)
## 
## Call:errorsarlm(formula = z ~ u + v, data = HamiltonDAs, listw = HamiltonDAs.w)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -2.801195 -0.845856  0.054448  0.793607  2.753617 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -3.89916    0.63027 -6.1865 6.151e-10
## u           20.99256    1.66815 12.5844 < 2.2e-16
## v           45.92072    1.80719 25.4100 < 2.2e-16
## 
## Lambda: 0.5839, LR test value: 70.68, p-value: < 2.22e-16
## Asymptotic standard error: 0.063578
##     z-value: 9.184, p-value: < 2.22e-16
## Wald statistic: 84.345, p-value: < 2.22e-16
## 
## Log likelihood: -446.2198 for error model
## ML residual variance (sigma squared): 1.0996, (sigma: 1.0486)
## Number of observations: 297 
## Number of parameters estimated: 5 
## AIC: 902.44, (AIC for lm: 971.12)

The coefficient \(\lambda\) is positive (indicative of positive autocorrelation) and high, since about 50% of the moving average of the residuals \(\epsilon\) in the neighborhood of \(i\) contribute to the value of \(\epsilon_i\).

You can verify that the residuals are spatially uncorrelated (note that the alternative is “less” because of the negative sign of Moran’s \(I\) coefficient):

moran.test(model.sem1$residuals, HamiltonDAs.w, alternative = "less")
## 
##  Moran I test under randomisation
## 
## data:  model.sem1$residuals  
## weights: HamiltonDAs.w    
## 
## Moran I statistic standard deviate = -0.99147, p-value = 0.1607
## alternative hypothesis: less
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.037200700      -0.003378378       0.001163727

Now consider the case of a missing covariate:

model.sem2 <- errorsarlm(formula = log(z) ~ u, 
                        data = HamiltonDAs, 
                        listw = HamiltonDAs.w)
summary(model.sem2)
## 
## Call:errorsarlm(formula = log(z) ~ u, data = HamiltonDAs, listw = HamiltonDAs.w)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.4528582 -0.0706124  0.0077446  0.0831516  0.4621741 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  1.75329    0.20266  8.6512 < 2.2e-16
## u            1.89674    0.65840  2.8808  0.003966
## 
## Lambda: 0.92272, LR test value: 492.33, p-value: < 2.22e-16
## Asymptotic standard error: 0.021523
##     z-value: 42.87, p-value: < 2.22e-16
## Wald statistic: 1837.9, p-value: < 2.22e-16
## 
## Log likelihood: 159.879 for error model
## ML residual variance (sigma squared): 0.015466, (sigma: 0.12436)
## Number of observations: 297 
## Number of parameters estimated: 4 
## AIC: -311.76, (AIC for lm: 178.57)

In this case, the residual pattern is particularly strong, with more than 90% of the moving average contributing to the residuals. Alas, in this case, the remedial action falls short of cleaning the residuals, and we can see that they still remain spatially correlated:

moran.test(model.sem2$residuals, HamiltonDAs.w, alternative = "less")
## 
##  Moran I test under randomisation
## 
## data:  model.sem2$residuals  
## weights: HamiltonDAs.w    
## 
## Moran I statistic standard deviate = -3.3739, p-value = 0.0003705
## alternative hypothesis: less
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.118141097      -0.003378378       0.001156981

This would suggest the need for alternative action (such as the search for additional covariates).

Ideally, a model should be well-specified, and remedial action should be undertaken only when other alternatives have been exhausted.

30 Activity 14: Area Data VI

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

30.1 Practice questions

Answer the following questions:

  1. Describe and discuss the possible sources of autocorrelation in the residuals of a model.
  2. List possible corrective/remedial actions when residual autocorrelation is detected.
  3. Under which situations is a Spatial Error Model an adequate modeling strategy?

30.2 Learning objectives

In this activity, you will:

  1. Explore a dataset with area data using visualization as appropriate.
  2. Discuss a process that might explain any pattern observed from the data.
  3. Conduct a modeling exercise using appropriate techniques. Justify your modeling decisions.

30.3 Suggested reading

O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 5. John Wiley & Sons: New Jersey.

30.4 Preliminaries

It is good practice to clear the workspace to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity (load other packages as appropriate).

library(isdas)
library(sf)
library(spatstat)
library(spdep)
library(tidyverse)

Choose a data set with area data that interests you. These are two possibilities:

30.4.1 New York leukemia data

data("nyleukemia")

A SpatialPolygonsDataFrame that contains the following variables:

  • AREANAME name of census tract
  • AREAKEY unique FIPS code for each tract
  • POP8 population size (1980 U.S. Census)
  • TRACTCAS number of cases of leukemia (1978-1982)
  • PROPCAS proportion of cases per tract
  • PCTOWNHOME percentage of people in each tract owning their own home
  • PCTAGE65P percentage of people in each tract aged 65 or more
  • Z transformed proportions
  • AVGIDIST average distance between centroid and TCE sites
  • PEXPOSURE “exposure potential”: inverse distance between each census tract centroid and the nearest TCE site, IDIST, transformed via log(100*IDIST)

This can be converted to a simple features object as follows:

nyleukemia.sf <- st_as_sf(nyleukemia)

30.4.2 Pennsylvania lung cancer

data("pennlc")

A SpatialPolygonsDataFrame that contains the following variables:

  • county: Name of the county
  • cases: Number of cases of lung cancer
  • population: Population by county
  • rate: Lung cancer rate by county
  • smoking: Smoking rate by county
  • cancer_ rate: Lung cancer rate by county (%)
  • smoking_rate: Smoking rate by county (%)

This can be converted to a simple features object as follows:

pennlc.sf <- st_as_sf(pennlc)

30.5 Activity

Capstone Activity

This is a capstone activity where you can work free-style on a data set of your choice, and put in practice what you have learned with respect to the analysis of area data.

  1. Partner with a fellow student to analyze the chosen dataset.

  2. Visualize/explore the dataset using appropriate tools.

  3. Analyze your dataset by means of regression modeling. Which should be the dependent variable in your dataset? Why?

  4. Discuss the results of your analysis, including possible limitations, and possible ways to improve it (e.g., what additional variables would you like to use?)

(PART) Part V: Analysis and Prediction of Fields

31 Spatially Continuous Data I

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In this chapter you will use a custom function that is included in the package isdas as follows:

kpointmeans(source_xy, target_xy, z, k, latlong)

This is a function to calculate \(k\)-point means. It takes a simple features object with the variable that we wish to interpolate (source_xy), that is, the coordinates of observations to be used for interpolation; the variable z to interpolate should be a column in source_xy; a simple features object with the points where we wish to interpolate variable z (target_xy); the number of nearest neighbors k; and a logical value to indicate whether the coordinates are latitude-longitude (the default is FALSE).

31.1 Learning objectives

Previously, you learned about the analysis of area data. Starting with this practice, you will be introduced to another type of spatial data: continuous data, also called fields. In this practice, you will learn:

  1. About spatially continuous data/fields.
  2. Exploratory visualization.
  3. The purpose of spatial interpolation.
  4. The use of tile-based approaches.
  5. Inverse distance weighting.
  6. K-point means.

31.2 Suggested readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 4. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

31.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(deldir)
library(isdas)
library(plotly)
library(spatstat)
library(spdep)
library(tidyverse)

Begin by loading the data you will need for this Chapter:

data("Walker_Lake")

You can verify the contents of the dataframe:

summary(Walker_Lake)
##       ID                  X               Y               V                U           T      
##  Length:470         Min.   :  8.0   Min.   :  8.0   Min.   :   0.0   Min.   :   0.00   1: 45  
##  Class :character   1st Qu.: 51.0   1st Qu.: 80.0   1st Qu.: 182.0   1st Qu.:  83.95   2:425  
##  Mode  :character   Median : 89.0   Median :139.5   Median : 425.2   Median : 335.00          
##                     Mean   :111.1   Mean   :141.3   Mean   : 435.4   Mean   : 613.27          
##                     3rd Qu.:170.0   3rd Qu.:208.0   3rd Qu.: 644.4   3rd Qu.: 883.20          
##                     Max.   :251.0   Max.   :291.0   Max.   :1528.1   Max.   :5190.10          
##                                                                      NA's   :195

This dataframe includes a sample of of geocoded observations with false coordinates X and Y, of two quantitative variables V, U, and a factor variable T. The variables are generic, but you can think of them as measurements of pollutants. The Walker Lake dataset originally was used for teaching geostatistics in Isaaks and Srivastava’s (1989) book An Introduction to Geostatistics.

31.4 Spatially continuous (field) data

Previously in the book we discussed two types of data that are of interest in spatial analysis: points and events, and areas.

The last section of the course will deal with a third type of data that finds numerous applications in many disciplines.

We will begin by recalling that there are different units of support for spatial data. The unit of support is the type of spatial object that is used to represent a spatial phenomenon, and that is useful to understand the kind of process and the types of analysis that can be applied.

In the case of point pattern analysis, the unit of support is the point. Depending on the scale of the analysis, the point could be anything from the centroid of cells, the location of trees, the addresses of businesses, or the centers of cities at a much larger scale. Obviously, none of these objects are actual points (the point is a theoretical object). However, points are a reasonable representation for events when their size is minuscule compared to the area of the region under analysis. The most basic attribute of an event is whether it is present (e.g., is there a tree at this location?). Other attributes are conditional on that one.

In the case of areas, the unit of support is a zone. Data in this type of analysis may or not be generated by a discontinuous process, but once it has been cast in the form of statistics for areas, this will usually involve discontinuities at the edges of the areas.

An important difference between point pattern analysis and analysis of data in areas is the source of the randomness.

In the case of point pattern analysis, the coordinates of the event are assumed to be the outcome of a random process. In area data, the locations of the data are exogenously given, and the source of randomness instead is in the values of the attributes.

This brings us to spatially continuous data.

Superficially, spatially continuous data looks like points. This is because of how a field is measured at discrete locations. The underlying process, however, is not discrete, and a field can in principle be measured at any location in the space where the underlying phenomenon is in operation. Some obvious examples of this include temperature and elevation. Temperature is measured at discrete locations, but the phenomenon itself is extensive. Same thing with elevation.

The source of randomness in the case of fields is the inherent uncertainty of the outcome of the process at locations where it was not measured. Therefore, an essential task is to predict values at unmeasured locations. We call this task spatial interpolation. In addition, we often are interested in assessing the degree of uncertainty of predictions or interpolated values.

The study of continuous data has been heavily influenced by the work of South African mining engineer D.G. Krige, who sought to estimate the distribution of minerals based on samples of boreholes. Since then, the study of fields has found applications in remote sensing, real estate appraisal, environmental science, hydrogeology, and many other disciplines.

We will define a field as a mixed spatial process that depends on the coordinates \(u_i\) and \(v_i\), in addition to a vector of covariates \(\bf{x}_i\): \[ z_i = f(u_i, v_i, \bf{x}_i) + \epsilon_i \] where \(i\) is an arbitrary location in the region, and \(\epsilon_i\) is the difference between the systematic description of the process (i.e., \(f(u_i, v_i, \bf{x}_i)\)) and the value of the field \(z\).

More simply, a field could be the outcome of a purely spatial process as follows: \[ z_i = f(u_i, v_i) + \epsilon_i \]

The value of a field is known at the locations \(i\) where it is measured. In locations where the field was not measured (we will call any such location \(p\)), there will be some uncertainty about the value of the field variable, which stems from our limited knowledge of the underlying process. As a consequence, there will be a random term associated with any prediction of the value of the field: \[ \hat{z}_p = \hat{f}(u_p, v_p) + \hat{\epsilon}_p \] We use the hat notation to indicate that these are estimates of the true values.

A key task in the analysis of fields is to determine a suitable function for making predictions \(\hat{z}_p\) and to estimate the uncertainty as well.

In this and upcoming sessions you will learn about methods to achieve this task.

31.5 Exploratory visualization

We will begin our discussion of fields with techniques for exploratory visualization. The methods are very similar to those used for marked point patterns in point pattern analysis. Like we did there, we can use dot or proportional symbol maps. For example, we can create a proportional symbol map of the variable V in the Walker Lake dataset (with alpha = 0.5 for some transparency to mitigate the overplotting):

ps1 <- ggplot(data = Walker_Lake, 
              aes (x = X, y = Y, color = V, size = V)) +
  # `alpha` is used to control the transparency of the objects,
  # with 1 being completely opaque and 0 completely transparent.
  geom_point(alpha = 0.5) + 
  scale_color_distiller(palette = "OrRd",
                        direction = 1) +
  # `Coord_equal()` ensures that units in the x and y axis are displayed using 
  # the same aspect ratio 1:1
  coord_equal() 

ps1

The proportional symbols indicate the location where a measurement was made. There is no randomness in these locations, as they were selected by design. In particular, notice how a regular grid seems to have been used for part of the sampling, and then possibly there was further infill sampling at those places where the field appeared to vary more.

Imagine that the observations are of a contaminant. The task could be to calculate the total amount of the contaminant over the region. This would require you to obtain estimates of the contaminant in all the region, not just those places where measurements were made. If, as is typically the case, making more observations is expensive, other approaches must be adopted.

Before proceeding, remember that the package plotly can be used to enhance exploratory analysis by allowing user interactivity. Below is the same plot as before, but now as an interactive 3D scatterplot:

plot_ly(data = Walker_Lake,
               x = ~X, 
               y = ~Y, 
               z = ~V,
               marker = list(color = ~V, 
                             colorscale = c("Orange", "Red"), 
                             showscale = TRUE)) %>% 
  add_markers() #adding traces to a plotly visualization

31.6 Tile-based methods

Another common approach to visualize fields is by means of tile-based methods. These methods take a set of points and convert them into a tessellation, thus giving them the aspect of “tiles”.

A widely used algorithm to convert points into tiles is called Voronoi polygons, after Georgy Voronoi, the mathematician that discovered it. To illustrate how Voronoi polygons are created, we will use a simple example.

  1. Given a set of generating points \(p_g\) with coordinates \((u_g, u_g)\) (for \(g = 1,...,n\)) and values of a variable \(z_{p_g}\):
# Create a set of coordinates for the example
uv_coords <- matrix(c(0.7, 5.2, 3.3, 1.3, 5.4, 0.5, 1.8, 2.3, 4.8, 5.5), c(5, 2)) %>% 
  st_multipoint("XY")

# Create a window for the points, this is similar to the windows 
# used in `spatstat` for spatial point pattern analysis
box = st_polygon(list(rbind(c(0,0),c(6,0),c(6,6),c(0,6),c(0,0))))

# Create a plot of the coordinates and the window
p <- ggplot(data = uv_coords) + 
  geom_sf(size = 2) +
  geom_sf(data = box, 
          fill = NA)
p

  1. Each point is connected by means of straight lines to its two nearest neighbors to create a triangulation:
# Create a triangulation that connect each point to its two 
# nearest neighbors. The function `st_triangulate()` from 
# the `sf` package does this. The output can be polygons 
# (triangles) or lines only. We set bOnlyEdges = TRUE to 
# obtain only the lines.
l2n <- st_triangulate(uv_coords, bOnlyEdges = TRUE)

# Plot the triangulation, i.e., the lines between nearest neighbors
ggplot(data = uv_coords) + 
  geom_sf(size = 2) +
  geom_sf(data = box, 
          fill = NA) +
  geom_sf(data = l2n, 
          color = "gray", 
          linetype = "dashed")

Notice that the plot above is already a tessellation, with the points at the vertices of the triangles.

  1. The perpendicular bisectors of each triangle are found and extended, until they intersect. The resulting tessellation is a set of Voronoi polygons:
# The function `st_voronoi()` from the `sf` package is used 
# to create Voronoi polygons based on points
vor <- st_voronoi(uv_coords) 

ggplot(data = uv_coords) +
  geom_sf(size = 2) + 
  geom_sf(data = l2n, 
          color = "gray", 
          linetype = "dashed") + 
  geom_sf(data = vor, 
          fill = NA) + 
  coord_sf(xlim = c(0, 6), 
           y = c(0, 6))

The triangulation was used to generate a second tessellation, i.e., the Voronoi polygons. These polygons have the property that any point \(p_i\) inside the polygon with generating point \(p_g\) in it, is closer to \(p_g\) than to any other generating point \(p_k\) on the plane. For this reason, Voronoi polygons are used to obtain areas of influence, among other applications.

There are other ways of obtaining Voronoi polygons, as Figure 1 below illustrates. Voronoi polygons in the figure are created by radial growth. The basic concept is the same, but implemented in a different way: find every point that is closest to \(p_g\). When two circles touch, they become the boundary between all points that are closer to \(p_g\) and \(p_k\) respectively. Continue growing until the plane is fully covered.

Figure 1. Voronoi polygons by radial growth

The Voronoi polygons for the sample data set can be obtained in R as follows.

First, we will convert the Walker_Lake dataframe to a simple features object using as follows:

# Function `st_as_sf()` takes a foreign object 
# (foreign to the `sf` package) and converts it 
# into a simple features object. If the foreign 
# object is points, the coordinates can be named 
# by means of the argument `coords`. 
Walker_Lake.sf <- Walker_Lake %>% 
  st_as_sf(coords = c("X", "Y"))

Once we have an sf object of the points, the geometry can be used to create the Voronoi polygons:

# The function `do.call(what, arg)` applies a function
# `what` to the argument `arg`. In this case, we extract 
# the geometry of the `sf` object (i.e., the coordinates 
# of the points) and apply the function `c()` to concatenate 
# the coordinates to obtain a MULTIPOINT object.   
# The pipe operator passes the MuLTIPOINT object to function `st_voronoi()`
vpolygons <- do.call(c, st_geometry(Walker_Lake.sf)) %>% 
  st_voronoi() %>% 
  # The output of `st_voronoi()` is a collection of geometries, 
  # which we pass to the following function for extraction.
  st_collection_extract()

After the step above we already have the Voronoi polygons:

ggplot(vpolygons) + 
  geom_sf(fill = NA)

However, these polygons are just the geometry and lack other attributes that we originally had for the points. See:

head(vpolygons)
## Geometry set for 6 features 
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -275 ymin: -19.79545 xmax: 39.35 ymax: 232.9318
## CRS:           NA
## First 5 geometries:
## POLYGON ((-275 109.5, -180.5 109.5, 10.8913 99....
## POLYGON ((-275 91.88636, 4.207317 79.19512, 14....
## POLYGON ((-275 49, -171.5 49, 18.9248 38.42084,...
## POLYGON ((20.40385 52.61538, 18.36567 50.35075,...
## POLYGON ((18.36567 50.35075, 20.40385 52.61538,...

For this reason, we need to join the Voronoi polygons to the attributes of the points. To do this, we will first copy the sf object with the original points to a new dataframe, and then replace the geometry of the points with the geometry of the polygons:

Walker_Lake.v <- Walker_Lake.sf
Walker_Lake.v$geometry <- vpolygons[unlist(st_intersects(Walker_Lake.sf, vpolygons))] 

The new Walker_Lake.v object now includes the attributes of the original points as well as the geometry of the polygons:

head(Walker_Lake.v)
## Simple feature collection with 6 features and 4 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -275 ymin: -275 xmax: 35.5 ymax: 120
## CRS:           NA
## # A tibble: 6 × 5
##   ID        V     U T                                                                                         geometry
##   <chr> <dbl> <dbl> <fct>                                                                                    <POLYGON>
## 1 1        0     NA 2     ((-275 -275, -275 -19.79545, 18.05556 20.16667, 19.59836 18.84426, 35.5 -87.16667, 35.5 -...
## 2 2        0     NA 2             ((-275 49, -171.5 49, 18.9248 38.42084, 18.05556 20.16667, -275 -19.79545, -275 49))
## 3 3      224.    NA 2     ((-171.5 49, 8.1875 57.98438, 18.36567 50.35075, 20.06057 39.61639, 18.9248 38.42084, -17...
## 4 4      434.    NA 2     ((-275 91.88636, 4.207317 79.19512, 14.18919 72.54054, 14.68421 66.10526, 8.1875 57.98438...
## 5 5      412.    NA 2     ((-275 109.5, -180.5 109.5, 10.8913 99.93043, 14.28571 95.85714, 10.1087 83.32609, 4.2073...
## 6 6      587.    NA 2                    ((-180.5 109.5, 19 120, 20 119.1, 20 109.95, 10.8913 99.93043, -180.5 109.5))

We can now plot the attributes as the polygons. The value of \(z\) for a tile is the same as the value of the variable for its corresponding generating point, or \(z_{p_g}\). This is the plot for the current example:

ggplot(Walker_Lake.v) + 
  geom_sf(aes(fill = V)) +
  scale_fill_distiller(palette = "OrRd", 
                       direction = 1)

We can see that the Voronoi polygons extend well beyond the extent of the original points, and in the plot add a large amount of unnecessary area. We can improve the plot in two ways, by limiting the extent for plotting, or by clipping the polygons. Here we will try the latter, with a bounding box that covers the region of interest:

# Function `st_polygon()` creates an `sf` object with a 
# polygon or polygons. In this case, we create a single 
# polygon, a rectangle with corners given by the coordinates 
# in the function. 
W.bbox <- st_polygon(list(rbind(c(0,0),
                                c(259,0),
                                c(259, 299),
                                c(0, 299),
                                c(0,0))))

The intersection of the polygons with the box clips the polygons:

Walker_Lake.v <- Walker_Lake.v %>%
  st_intersection(W.bbox)
## Warning: attribute variables are assumed to be spatially constant throughout all geometries

This is the plot after fixing this issue:

ggplot(Walker_Lake.v) + 
  geom_sf(aes(fill = V)) + 
  geom_sf(data = Walker_Lake.sf,
          size = 0.1) +
  scale_fill_distiller(palette = "OrRd", 
                       direction = 1)

As you can see, the points in the sample have been converted to a surface from which the value of \(z\) can be estimated at any point as desired, from the value of \(z\) of the closest point used to generate the tiles. This can be expressed as follows: \[ \hat{z}_p = z_{p_g}\text{ for } p_g\text{ with } d_{pp_g}<d_{pp_k}\forall{k} \]

31.7 Inverse distance weighting (IDW)

The tile-based approach above assumes that the field is flat within each polygon (see Figure 2). This is in most cases an unrealistic assumption. Other approaches to interpolate a spatial variable allow the estimated value of \(z_p\) to vary with proximity to observations. Such is the case of IDW.

Figure 2. A field according to Voronoi polygons

Inverse distance weighting takes the following form: \[ \hat{z}_p = \frac{\sum_{i=1}^n{w_{pi}z_i}}{\sum_{i=1}^n{w_{pi}}} \]

This will probably look familiar to you, because it is formally identical to the spatial moving average. The difference is in how the “spatial weights” \(w_{pi}\) are defined. For IDW, the spatial weights are given by a function of the inverse power of distance, as follows: \[ w_{pi} = \frac{1}{d_{pi}^\gamma} \] In the expression above, parameter \(\gamma\) controls the steepness of the decay function, with smaller values giving greater weight to more distant locations. Large values of \(\gamma\) converge to a 1-point average (so that the interpolated value is identical to the nearest observation; you can verify this).

We can see that inverse distance weighting is a weighted average of all observations in the sample, but with greater weight given to more proximate observations. This approach is implemented in R in the package spatstat with the function idw. To use this function, the points must be converted into a ppp object. This necessitates that we define a window object, which we do based on the bounding box that we created for the Voronoi polygons:

# Function `as.owin()` takes the polygon with the bounding box
# we created above and converts it into an `owin` object for 
# use with the `spatstat` package. 
W.owin <- as.owin(W.bbox)

# We can create a `ppp` object with the coordinates of the points
Walker_Lake.ppp <- as.ppp(X = Walker_Lake[,2:4], W = W.owin)

The call to the function requires a ppp object and the argument for the power to use in the inverse distance function. In this call, the power is set to 1:

z_p.idw1 <- idw(Walker_Lake.ppp, power = 1)

The value (output) of this function is an im object. Objects of this type are used by the package spatstat to work with raster data. It can be simply plotted as follows:

plot(z_p.idw1)

Or the information can be extracted for greater control of the aspect of the plot in ggplot2:

data.frame(expand.grid(X= z_p.idw1$xcol,
                       Y = z_p.idw1$yrow),
           # transpose matrix
           V = as.vector(t(z_p.idw1$v))) %>%
  ggplot(aes(x = X,
             y = Y, 
             fill = V)) + 
  geom_tile() +
  scale_fill_distiller(palette = "OrRd", 
                       direction = 1) +
  coord_equal()

Notice the dots where the observations are - the value of the field is known there. We can explore the effect of changing the parameter for the power, by using \(\gamma = 0.5, 1, 2, \text{ and } 5\):

#Inverse distance weighting for Walker Lake 
# using three different gamma variables 

z_p.idw05 <- idw(Walker_Lake.ppp, power = 0.5)
z_p.idw2 <- idw(Walker_Lake.ppp, power = 2)
z_p.idw5 <- idw(Walker_Lake.ppp, power = 5)

For ease of comparison, we will collect the information into a single data frame:

z_p.idw05.df <- data.frame(expand.grid(X = z_p.idw05$xcol, 
                                       Y = z_p.idw05$yrow),
                           V = as.vector(t(z_p.idw05$v)),
                           Power = "P05")
z_p.idw1.df <- data.frame(expand.grid(X= z_p.idw1$xcol, 
                                      Y = z_p.idw1$yrow),
                          V = as.vector(t(z_p.idw1$v)),
                          Power = "P1")
z_p.idw2.df <- data.frame(expand.grid(X= z_p.idw2$xcol, 
                                      Y = z_p.idw2$yrow),
                          V = as.vector(t(z_p.idw2$v)), 
                          Power = "P2")
z_p.idw5.df <- data.frame(expand.grid(X= z_p.idw5$xcol, 
                                      Y = z_p.idw5$yrow),
                          V = as.vector(t(z_p.idw5$v)), 
                          Power = "P5")

# Bind the data frames
idw_df <- rbind(z_p.idw05.df, 
                z_p.idw1.df, 
                z_p.idw2.df, 
                z_p.idw5.df)

We can now plot using the facet_wrap function to compare the results side by side:

ggplot(data = idw_df, 
       aes(x = X, 
           y = Y, 
           fill = V)) + 
  geom_tile() +
  scale_fill_distiller(palette = "OrRd", 
                       direction = 1) +
  coord_equal() + 
  facet_wrap(~ Power,
             ncol = 2)

Notice how smaller values of \(\gamma\) “flatten” the predictions, in the extreme tending towards to global average, as all observations are weighted equally. Larger values, on the other hand, tend to be the average of a single point, the closest one. In fact, this replicates the Voronoi polygons, as seen in the following plot that combines the Voronoi polygons (without filling!) and the predictions from the IDW algorithm with \(\gamma = 5\):

ggplot() + 
  geom_tile(data = subset(idw_df, 
                          Power = "P5"), 
            aes(x = X, 
                y = Y, 
                fill = V)) +
  geom_sf(data = Walker_Lake.v, 
          color =  "white", 
          fill = NA) +
  scale_fill_distiller(palette = "OrRd", 
                       direction =  1)

Clearly, selection of a value for \(\gamma\) is an important modeling decision when using IDW.

31.8 \(k\)-point means

Another interpolation technique that is based on the idea of moving averages is \(k\)-point means. Again, this will look familiar to you, because it is also formally identical to the spatial moving average: \[ \hat{z}_p = \frac{\sum_{i=1}^n{w_{pi}z_i}}{\sum_{i=1}^n{w_{pi}}} \]

The spatial weights in this case, however, are defined in terms of \(k\)-nearest neighbors: \[ w_{pi} = \bigg\{\begin{array}{ll} 1 & \text{if } i \text{ is one of } k \text{th nearest neighbors of } p \text{ for a given }k \\ 0 & otherwise \\ \end{array} \]

Clearly, the above becomes: \[ \hat{z}_p = \sum_{i=1}^n {w_{pi}^{st}z_i} \]

If row-standardized spatial weights are used.

We can calculate \(k\)-point means using the example. For this, we need to define a set of “target” coordinates, that is, the points where we wish to interpolate, which we also convert to simple features:

# Create a fine grid for prediction, i.e., our "target"
# points
target_xy <- expand.grid(x = seq(0.5, 
                                259.5, 
                                2.2), 
                        y = seq(0.5, 
                                299.5, 
                                2.2)) %>%
  st_as_sf(coords = c("x", "y"))

The inputs to the function are a simple features object with the variable that we wish to interpolate, a simple features object with the point to which we wish to interpolate, the name of the variable to interpolate (which must be in the source table), and the number of points to use for interpolations (see below). The value (output) of the function is a simple features table with the target points, as well as estimated values of \(\hat{z_p}\) at those points. Using the three nearest neighbors:

# Use the source and target points to interpolate
kpoint.3 <- kpointmean(source_xy = Walker_Lake.sf, 
                       target_xy = target_xy,
                       z = V,
                       k = 3) %>% 
  # Rename the columns to match the names of columns in
  # our other data frame
  rename(V = z)
## projected points

We can plot the interpolated field now:

ggplot() +
  geom_sf(data = kpoint.3, 
            aes(color = V)) +
  scale_color_distiller(palette = "OrRd", 
                       direction = 1)

As with other spatially moving averages, the crucial aspect of implementing \(k\)-point means is the selection of \(k\). A large value will tend towards the global average, whereas a value of 1 will tend to replicate the Voronoi polygons (see below):

# Calculate k-point means using only one point. Rename the variables to match 
kpoint.1 <- kpointmean(source_xy = Walker_Lake.sf, 
                       target_xy = target_xy,
                       z = V,
                       k = 1) %>% 
  # Rename the columns to match the names of columns in
  # our other data frame
  rename(V = z)
## projected points

This is the plot with the Voronoi polygons:

# Plot and overlay the Voronoi polygons
ggplot() + 
  geom_sf(data = kpoint.1, 
          aes(color = V)) +
  geom_sf(data = Walker_Lake.v, 
          color =  "white", fill = NA) +
  scale_color_distiller(palette = "OrRd", 
                       direction =  1)

This shows that Voronoi polygons can be seen as a special case of IDW or \(k\)-point means depending on the way these two techniques are implemented.

32 Activity 15: Spatially Continuous Data I

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

32.1 Practice questions

Answer the following questions:

  1. What is the difference between spatially continuous data and a spatial point pattern?
  2. What is the purpose of spatial interpolation?
  3. In your own words describe the method of Inverse Distance Weighting.
  4. Consider the following spatial interpolation algorithms: Voronoi polygons and k-point means. How do they differ when the number of points used to calculate means is 1?

32.2 Learning objectives

In this activity, you will:

  1. Explore a dataset with area data using visualization as appropriate.
  2. Discuss a process that might explain any pattern observed from the data.
  3. Conduct a modeling exercise using appropriate techniques. Justify your modeling decisions.

32.3 Suggested reading

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 4. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

32.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity (load other packages as appropriate).

library(isdas)
library(tidyverse)
library(spatstat)
library(spdep)

Load the data that you will use in this activity:

data("aquifer")

The data is a set of piezometric head (watertable pressure) observations of the Wolfcamp Aquifer in Texas (https://en.wikipedia.org/wiki/Hydraulic_head). Measures of pressure can be used to infer the flow of underground water, since water flows from high to low pressure areas.

These data were collected to evaluate potential flow of contamination related to a high level toxic waste repository in Texas. The Deaf Smith county site in Texas was one of three potential sites proposed for this repository. Beneath the site is a deep brine aquifer known as the Wolfcamp aquifer that may serve as a conduit of contamination leaking from the repository.

The data set consists of 85 georeferenced measurements of piezometric head. Possible applications of interpolation are to determine sites at risk and to quantify uncertainty of the interpolated surface, to evaluate the best locations for monitoring stations.

Create an a unique identifier variable:

aquifer$ID <- factor(c(1:nrow(aquifer)))

32.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Map the Wolfcamp Aquifer data.

  2. Create a surface using Voronoi polygons.

  3. Create a surface using IDW.

  4. Create a surface using \(k\)-point means.

Activity Part II

  1. What is the effect of changing the power of the inverse distance function?

  2. What is the effect of changing the number of points used in this algorithm?

  3. Discuss the limitations of these approaches. How can you calculate the uncertainty in the predictions?

33 Spatially Continuous Data II

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

Previously, you learned about the analysis of area data. Starting with this practice, you will be introduced to another type of spatial data: continuous data, also called fields.

33.1 Learning objectives

In the previous practice you were introduced to the concept of fields/spatially continuous data. Three different approaches were discussed that can be used to convert a set of observations of a field at discrete locations into a surface, namely tile-based approaches, inverse distance weighting (IDW), and \(k\)-point means. In this practice, you will learn:

  1. About intervals of confidence for predictions.
  2. Using trend surface analysis as an interpolation tool.
  3. The difference between accuracy and precision in interpolation.

33.2 Suggested readings

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 4. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

33.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(plotly)
library(spatstat)
library(spdep)
library(tidyverse)

Begin by loading the data file that we will use in this chapter:

data("Walker_Lake")

You can verify the contents of the dataframe:

summary(Walker_Lake)
##       ID                  X               Y               V                U           T      
##  Length:470         Min.   :  8.0   Min.   :  8.0   Min.   :   0.0   Min.   :   0.00   1: 45  
##  Class :character   1st Qu.: 51.0   1st Qu.: 80.0   1st Qu.: 182.0   1st Qu.:  83.95   2:425  
##  Mode  :character   Median : 89.0   Median :139.5   Median : 425.2   Median : 335.00          
##                     Mean   :111.1   Mean   :141.3   Mean   : 435.4   Mean   : 613.27          
##                     3rd Qu.:170.0   3rd Qu.:208.0   3rd Qu.: 644.4   3rd Qu.: 883.20          
##                     Max.   :251.0   Max.   :291.0   Max.   :1528.1   Max.   :5190.10          
##                                                                      NA's   :195

We have already met this data set before: it contains a set of coordinates X and Y (units are meters; the origin is false), as well as two quantitative variables V and U (notice that there are missing observations in U), and a factor T.

33.4 Uncertainty in the predictions

A common task in the analysis of spatially continuous data is to estimate the value of a variable at a location where it was not measured - or in other words, to spatially interpolate the variable. In Chapter @ref(spatially-continuous-data-i), we introduced three methods for spatial interpolation based on a sample of observations.

The three algorithms that was saw before (i.e., Voronoi polygons, IDW, and \(k\)-point means) accomplish the task of providing spatial estimates. The values that we obtain with these methods are called point estimates. What is a point estimate? Recall the definition of a field that is the outcome of a purely spatial process: \[ z_i = f(u_i, v_i) + \epsilon_i \]

Accordingly, the prediction of the field at a new location is defined as a function of the estimated process and some random residual as follows: \[ \hat{z}_p = \hat{f}(u_p, v_p) + \hat{\epsilon}_p \] The first part of the prediction (\(\hat{f}(u_p, v_p)\)) is the point estimate of the prediction, whereas the second part (\(\hat{\epsilon}_p\)) is the random part of the estimate.

The methods we saw in Chapter @ref(spatially-continuous-data-i) can be used to estimate point estimates of the process. Unfortunately, they do not provide an estimate for the random element, so it is not possible to assess the uncertainty of the estimated values directly, since this depends on the distribution of the random term.

There are different ways in which at least some crude assessment of uncertainty can be attached to point estimates obtained from Voronoi polygons, IDW, or \(k\)-point means. For example, a very simple approach could be to use the sample variance to calculate intervals of confidence. This could be done as follows.

We know that the sample variance describes the inherent variability in the distribution of values of a variable in a sample. Consider for instance the distribution of the variable in the Walker Lake dataset:

ggplot(data = Walker_Lake, aes(V)) + 
  geom_histogram(binwidth = 60)

Clearly, there are no values of the variable less than zero, and values in excess of 1,000 are rare.

The standard deviation of the sample is:

sd(Walker_Lake$V)
## [1] 301.1554

The standard deviation is the average deviation from the mean. We could use this value to say that typical deviations from our point estimates are a function of this standard deviation (to what extent, it depends on the underlying distribution).

A problem with using this approach is that the distribution of the variable is not normal, and the distribution of \(\hat{\epsilon}_p\) is unknown; the standard deviation is centered on the mean (meaning that it is a poor estimate for observations away from the mean); and in any case the standard deviation of the sample is too large for local point estimates if there is spatial pattern (since we know that the local mean will vary systematically).

There are other approaches to deal with non-normal variables, for instance Wilcox’s test, but some of the other limitations remain.

wilcox.test(Walker_Lake$V, conf.int = TRUE, conf.level = 0.95)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Walker_Lake$V
## V = 100576, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
##  418.7 475.6
## sample estimates:
## (pseudo)median 
##          447.2

As an alternative, the local standard deviation could be used.

Consider the case of \(k\)-point means. The point estimate is based on the values of the \(k\)-nearest neighbors: \[ \hat{z}_p = \frac{\sum_{i=1}^n{w_{pi}z_i}}{\sum_{i=1}^n{w_{pi}}} \]

With: \[ w_{pi} = \bigg\{\begin{array}{ll} 1 & \text{if } i \text{ is one of } kth \text{ nearest neighbors of } p \text{ for a given }k \\ 0 & otherwise \\ \end{array} \]

The standard deviation could be calculated also based in the values of the \(k\)-nearest neighbors, meaning that it would be based on the local mean. Here, we will interpolate the field using the Walker Lake data. First create a target grid for interpolation, and extract the coordinates of observations:

# Create a prediction grid and convert to simple features:
target_xy = expand.grid(x = seq(0.5, 259.5, 2.2), 
                        y = seq(0.5, 299.5, 2.2)) %>%
  st_as_sf(coords = c("x", "y"))

# Convert the `Walker_Lake` dataframe to a simple features object using as follows:
Walker_Lake.sf <- Walker_Lake %>% 
  st_as_sf(coords = c("X", "Y"))

Interpolation using \(k=5\) neighbors:

kpoint.5 <- kpointmean(source_xy = Walker_Lake.sf, 
                       target_xy = target_xy, 
                       z = V,  
                       k = 5) %>%
  rename(V = z)
## projected points

We can plot the interpolated field now. These are the interpolated values:

ggplot() +
  geom_sf(data = kpoint.5, 
          aes(color = V)) +
  scale_color_distiller(palette = "OrRd", 
                       direction = 1)

In addition, we can plot the local standard deviation:

ggplot() +
  geom_sf(data = kpoint.5, 
          aes(color = sd)) +
  scale_color_distiller(palette = "OrRd", 
                       direction = 1)

The local standard deviation indicates the typical deviation from the local mean. As expected, the local values of the standard deviation are usually lower than the standard deviation of the sample, and it tends to be larger for the tails, that is the locations where the values are rare - we have less information, hence greater uncertainty.

The local standard deviation is a crude estimator of the uncertainty because we do not know the underlying distribution. Other approaches based on bootstrapping (randomly sampling from the observed values of the variable) could be implemented, but they are beyond the scope of the present discussion.

The issue of assessing the level of uncertainty in the predictions with Voronoi polygons, IDW, and \(k\)-point means reflects the fact that these methods were not designed to deal explicitly with the random nature of predicting fields. Other methods deal with this issue more naturally. We will revisit two estimation methods that we covered before, and see how they can be applied to spatial interpolation.

33.5 Trend surface analysis

Trend surface analysis is a form of multivariate regression that uses the coordinates of the observations to fit a surface to the data.

We can illustrate this technique by means of a simulated example. We will begin by simulating a set of observations, beginning with the coordinates in the square unit region:

# `n` is the number of observations to simulate
n <- 180

# Here we create a dataframe with these values: `u` and `v` will be the coordinates of our process
df <- data.frame(u = runif(n = n, min = 0, max = 1), 
                 v = runif(n = n, min = 0, max = 1))

Once we have simulated the coordinates for the example, we can plot their locations:

ggplot(data = df, aes(x = u, y = v)) + 
  geom_point() + 
  coord_equal()

We can now proceed to simulate a spatial process as follows:

# Use `mutate()` to create a new stochastic variable `z` that is a function of the coordinates and a random normal variable that we create with `rnorm()`; this random variable has a mean of zero and a standard deviation of 0.1.
df <- mutate(df, z = 0.5 + 0.3 * u + 0.7 * v + rnorm(n = n, mean = 0, sd = 0.1))

A 3D scatterplot can be useful to explore the data:

# Create a 3D scatterplot with the function `plot_ly()`. Notice that the way this function works is similar to `ggplot2`: the arguments are a dataframe, what should be plotted on the x-axis, the y-axis, the z-axis, and other aesthetics (aspects) of the plot. Here the color will be proportional to the values of `z`. The function `add_markers()` is similar to the family of `geom_` functions in `ggplot2`, but more general, since it will try to guess what you are trying to plot based on the inputs (in this case points). The function `layout()` is used to control other parts of the plot: here the `aspectratio` is selected so that the scale is identical for all three axes.
plot_ly(data = df, x = ~u, y = ~v, z = ~z, color = ~z) %>% 
  add_markers() %>% 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x=1, y=1, z=1)))

We can fit a trend surface to the data as follows. This is a regression model that uses the coordinates of the observations as covariates. In this case, the trend is linear:

trend.l <- lm(formula = z ~ u + v, data = df)
summary(trend.l)
## 
## Call:
## lm(formula = z ~ u + v, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23764 -0.06859 -0.01136  0.06431  0.26406 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49315    0.01869   26.39   <2e-16 ***
## u            0.30348    0.02481   12.23   <2e-16 ***
## v            0.69825    0.02432   28.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09765 on 177 degrees of freedom
## Multiple R-squared:  0.8501, Adjusted R-squared:  0.8484 
## F-statistic:   502 on 2 and 177 DF,  p-value: < 2.2e-16

Given a trend surface model, we can estimate the value of the variable \(z\) at locations where it was not measured. Typically this is done by interpolating on a fine grid that can be used for visualization or further analysis, as shown next.

We will begin by creating a grid for interpolation. We will call the coordinates x.p and y.p. We generate these by creating a sequence of values in the domain of the data, for instance in the [0,1] interval:

u.p <- seq(from = 0.0, to = 1.0, by = 0.05)
v.p <- seq(from = 0.0, to = 1.0, by = 0.05)

For prediction, we want all combinations of x.p and y.p, so we expand these two vectors into a grid, by means of the function expand.grid():

# The function `expand.grid()` creates a grid with all the combination of values of the inputs.
df.p <- expand.grid(u = u.p, v = v.p)

Notice that while u.p and v.p are vectors of size 21, the dataframe df.p contains {r}21 * 21 observations, that is, all the combinations of u.p and v.p.

Once we have the coordinates for interpolation, the predict() function can be used in conjunction with the results of the estimation. When invoking the function, we indicate that we wish to obtain as well the standard errors of the fitted values (se.fit = TRUE), as well as the interval of the predictions at a 95% level of confidence:

preds <- predict(trend.l, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)

The interval of confidence of the predictions at the 95% level of confidence is given in the form of the lower (lwr) and upper (upr) bounds:

summary(preds$fit)
##       fit              lwr              upr        
##  Min.   :0.4932   Min.   :0.2969   Min.   :0.6894  
##  1st Qu.:0.8103   1st Qu.:0.6161   1st Qu.:1.0045  
##  Median :0.9940   Median :0.8008   Median :1.1873  
##  Mean   :0.9940   Mean   :0.7997   Mean   :1.1884  
##  3rd Qu.:1.1777   3rd Qu.:0.9834   3rd Qu.:1.3720  
##  Max.   :1.4949   Max.   :1.2988   Max.   :1.6910

These values indicate that the predictions of \(z_p\) are, with 95% of confidence, in the following interval: \[ CI_{z_p} = [z_{p_{lwr}}, z_{p_{upr}}]. \]

A convenient way to visualize the results of the analysis above, that is, to inspect the trend surface and the interval of confidence of the predictions, is by means of a 3D plot as follows.

First create matrices with the point estimates of the trend surface (z.p), and the lower and upper bounds (z.p_l, z.p_u):

z.p <- matrix(data = preds$fit[,1], nrow = 21, ncol = 21, byrow = TRUE)
z.p_l <- matrix(data = preds$fit[,2], nrow = 21, ncol = 21, byrow = TRUE)
z.p_u <- matrix(data = preds$fit[,3], nrow = 21, ncol = 21, byrow = TRUE)

The plot is created using the coordinates used for interpolation (x.p and y.p) and the matrices with the point estimates z.p and the upper and lower bounds. The type of plot in the package plotly is a surface:

trend.plot <- plot_ly(x = ~u.p, y = ~v.p, z = ~z.p, 
        type = "surface", colors = "YlOrRd") %>% 
  add_surface(x = ~u.p, y = ~v.p, z = ~z.p_l, 
              opacity = 0.5, showscale = FALSE) %>%
  add_surface(x = ~u.p, y = ~v.p, z = ~z.p_u, 
              opacity = 0.5, showscale = FALSE) %>% 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))

trend.plot

In this way, we have not only an estimate of the underlying field, but also a measure of uncertainty for our predictions, since our estimated values are bound, with 95% confidence, between the lower and upper surfaces.

It is important to note that, although the confidence interval provides a measure of uncertainty, it does not provide an estimate of the prediction error \(\hat{\epsilon}_p\). This quantity cannot be calculated directly, because we do not know the true value of the field at location \(p\). We will revisit this point later.

For the time being, we will apply trend surface analysis to the Walker Lake dataset.

We will first calculate the polynomial terms of the coordinates, for instance to the 3rd degree (this can be done to any arbitrary degree, however keeping in mind the caveats discussed previously with respect to trend surface analysis):

Walker_Lake <- mutate(Walker_Lake,
                        X3 = X^3, X2Y = X^2 * Y, X2 = X^2, 
                        XY = X * Y,
                        Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)

We can proceed to estimate the following models.

Linear trend surface model:

WL.trend1 <- lm(formula = V ~ X + Y, data = Walker_Lake)
summary(WL.trend1)
## 
## Call:
## lm(formula = V ~ X + Y, data = Walker_Lake)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -576.05 -241.79   -4.77  201.48 1055.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 589.5289    34.5547  17.061  < 2e-16 ***
## X            -1.0082     0.1903  -5.297 1.82e-07 ***
## Y            -0.2980     0.1727  -1.726   0.0851 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 292.1 on 467 degrees of freedom
## Multiple R-squared:  0.06338,    Adjusted R-squared:  0.05937 
## F-statistic:  15.8 on 2 and 467 DF,  p-value: 2.29e-07

Quadratic trend surface model:

WL.trend2 <- lm(formula = V ~ X2 + X + XY + Y + Y2, data = Walker_Lake)
summary(WL.trend2)
## 
## Call:
## lm(formula = V ~ X2 + X + XY + Y + Y2, data = Walker_Lake)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -579.37 -220.93    3.96  200.66  997.30 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 320.484250  70.904832   4.520 7.86e-06 ***
## X2            0.001145   0.003191   0.359  0.71994    
## X            -0.325993   0.910872  -0.358  0.72059    
## XY           -0.006281   0.002244  -2.799  0.00534 ** 
## Y             3.737955   0.722097   5.177 3.37e-07 ***
## Y2           -0.011409   0.002188  -5.215 2.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283.1 on 464 degrees of freedom
## Multiple R-squared:  0.1258, Adjusted R-squared:  0.1164 
## F-statistic: 13.35 on 5 and 464 DF,  p-value: 3.571e-12

Cubic trend surface model:

WL.trend3 <- lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
                data = Walker_Lake)
summary(WL.trend3)
## 
## Call:
## lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
##     data = Walker_Lake)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -564.19 -197.41    7.91  194.25  929.72 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.620e+00  1.227e+02  -0.070 0.944035    
## X3           1.533e-04  4.806e-05   3.190 0.001522 ** 
## X2Y          6.139e-05  3.909e-05   1.570 0.117000    
## X2          -6.651e-02  1.838e-02  -3.618 0.000330 ***
## X            9.172e+00  2.386e+00   3.844 0.000138 ***
## XY          -4.420e-02  1.430e-02  -3.092 0.002110 ** 
## Y            4.794e+00  2.040e+00   2.350 0.019220 *  
## Y2          -1.806e-03  1.327e-02  -0.136 0.891822    
## XY2          7.679e-05  2.956e-05   2.598 0.009669 ** 
## Y3          -4.170e-05  2.819e-05  -1.479 0.139759    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.7 on 460 degrees of freedom
## Multiple R-squared:  0.1719, Adjusted R-squared:  0.1557 
## F-statistic: 10.61 on 9 and 460 DF,  p-value: 5.381e-15

Inspection of the results of the three models above suggests that the cubic trend surface provides the best fit, with the highest adjusted coefficient of determination, even if the value is relatively low at approximately 0.16. Also, the cubic trend yields the smallest standard error, which implies that the intervals of confidence are tighter, and hence the degree of uncertainty is smaller.

We will compare two of these models to see how well they fit the data.

First, we create an interpolation grid. Summarize the information to ascertain the domain of the data:

summary(Walker_Lake[,2:3])
##        X               Y        
##  Min.   :  8.0   Min.   :  8.0  
##  1st Qu.: 51.0   1st Qu.: 80.0  
##  Median : 89.0   Median :139.5  
##  Mean   :111.1   Mean   :141.3  
##  3rd Qu.:170.0   3rd Qu.:208.0  
##  Max.   :251.0   Max.   :291.0

We can see that the spatial domain is in the range of [8,251] in X, and [8,291] in Y. Based on this information, we will generate the following sequence that then is expanded into a grid for prediction:

X.p <- seq(from = 0.0, to = 255.0, by = 2.5)
Y.p <- seq(from = 0.0, to = 295.0, by = 2.5)
df.p <- expand.grid(X = X.p, Y = Y.p)

To this dataframe we add the polynomial terms:

df.p <- mutate(df.p, X3 = X^3, X2Y = X^2 * Y, X2 = X^2, 
               XY = X * Y, 
               Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)

The interpolated quadratic surface is then obtained as:

WL.preds2 <- predict(WL.trend2, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)

Whereas the interpolated cubic surface is obtained as:

WL.preds3 <- predict(WL.trend3, newdata = df.p, se.fit = TRUE, interval = "prediction", level = 0.95)

The predictions are transformed into matrices for plotting.

Quadratic trend surface and lower and upper bounds of the predictions:

z.p2 <- matrix(data = WL.preds2$fit[,1], nrow = 119, ncol = 103, byrow = TRUE)
z.p2_l <- matrix(data = WL.preds2$fit[,2], nrow = 119, ncol = 103, byrow = TRUE)
z.p2_u <- matrix(data = WL.preds2$fit[,3], nrow = 119, ncol = 103, byrow = TRUE)

Cubic trend surface and lower and upper bounds of the predictions:

z.p3 <- matrix(data = WL.preds3$fit[,1], nrow = 119, ncol = 103, byrow = TRUE)
z.p3_l <- matrix(data = WL.preds3$fit[,2], nrow = 119, ncol = 103, byrow = TRUE)
z.p3_u <- matrix(data = WL.preds3$fit[,3], nrow = 119, ncol = 103, byrow = TRUE)

This is the quadratic trend surface with its confidence interval of predictions:

WL.plot2 <- plot_ly(x = ~X.p, y = ~Y.p, z = ~z.p2, 
        type = "surface", colors = "YlOrRd") %>% 
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p2_l, 
              opacity = 0.5, showscale = FALSE) %>%
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p2_u, 
              opacity = 0.5, showscale = FALSE) %>% 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))
WL.plot2

And, this is the cubic trend surface with its confidence interval of predictions:

WL.plot3 <- plot_ly(x = ~X.p, y = ~Y.p, z = ~z.p3, 
        type = "surface", colors = "YlOrRd") %>% 
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p3_l, 
              opacity = 0.5, showscale = FALSE) %>%
  add_surface(x = ~X.p, y = ~Y.p, z = ~z.p3_u, 
              opacity = 0.5, showscale = FALSE) %>% 
  layout(scene = list(
    aspectmode = "manual", aspectratio = list(x = 1, y = 1, z = 1)))
WL.plot3

Alas, these models are not very reliable estimates of the underlying field. As can be seen from the plots, the confidence intervals are extremely wide, and in both cases include negative numbers in the lower bound. The uncertainty associated with these predictions is quite substantial.

Another question, however, is whether the point estimates are accurate. To get a sense of whether this is the case we can add the observations to the plot:

WL.plot3 %>%
  add_markers(data = Walker_Lake, x = ~X, y = ~Y, z = ~V, 
              color = ~V, opacity = 0.7, showlegend = FALSE)

Alas, the trend surface does a mediocre job with the point estimates as well.

A possible reason for this is that the model failed to capture all or even most of the systematic spatial variability of this field. To explore this, we will plot the residuals of the model, after labeling them as “positive” or “negative”:

Walker_Lake$residual3 <- ifelse(WL.trend3$residuals > 0, "Positive", "Negative")

Plot the residuals:

ggplot(data = Walker_Lake, 
       aes(x = X, y = Y, color = residual3)) +
  geom_point() +
  coord_equal()

Visual inspection of the distribution of the residuals strongly suggests that they are not random. We can check this by means of Moran’s \(I\) coefficient, if we create a list of spatial weights as follows:

# Create a set of spatial weights with the 5 nearest neighbors.
WL.listw <- Walker_Lake[,2:3] %>% 
  as.matrix() %>%
  knearneigh(k = 5) %>%
  knn2nb() %>%
  nb2listw()

The results of the autocorrelation analysis of the residuals are:

moran.test(x = WL.trend3$residuals, listw = WL.listw)
## 
##  Moran I test under randomisation
## 
## data:  WL.trend3$residuals  
## weights: WL.listw    
## 
## Moran I statistic standard deviate = 17.199, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4633803457     -0.0021321962      0.0007325452

Given the low \(p\)-value, we fail to reject the null hypothesis, and conclude, with a high level of confidence, that the residuals are not independent. This has important implications for spatial interpolation, as we will discuss in the following chapter.

33.6 Accuracy and precision

Before concluding this chapter, it is worthwhile to make the following distinction between accuracy and precision of the estimates.

Accuracy refers to how close the predicted values \(\hat{z}_p\) are to the true values of the field. Precision refers to how much uncertainty is associated with such predictions. Narrow intervals of confidence imply greater precision, whereas the opposite is true when the intervals of confidence are wide.

An example of these two properties is as shown in Figure @ref{fig:accuracy-precision}.

\label{fig:accuracy-precision}Accuracy and precision

(#fig:accuracy-and-precision)Accuracy and precision

Panel a) in the figure represents a set of accurate points, since they are on average close to the mark. However, they are imprecise, given their variability. This is akin to a good point estimate that has wide confidence intervals.

Panel b) is a set of inaccurate and imprecise points.

Panel c) is a set of precise but inaccurate points.

Finally, Panel d) is a set of accurate and precise points.

Accuracy and precision are important criteria when assessing the quality of a predictive model.

This concludes the chapter.

34 Activity 16: Spatially Continuous Data II

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

34.1 Practice questions

Answer the following questions:

  1. What is a confidence interval?
  2. How does a confidence interval vary with the level of significance?
  3. Residuals of trend surface analysis are always spatially independent, true or false.
  4. Estimates of the prediction error \(\hat{\epsilon}_p\) can be obtained from trend surface analysis, true or false. Explain.
  5. In your own words describe the concepts of accuracy and precision in spatial interpolation.

34.2 Learning objectives

In this activity, you will:

  1. Use trend surface analysis to interpolate a field.
  2. Calculate the degree of uncertainty.
  3. Think about the role of residual autocorrelation in interpolation.

34.3 Suggested reading

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 4. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

34.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity (load other packages as appropriate).

library(isdas)
library(tidyverse)
library(spatstat)
library(spdep)

Load the data that you will use in this activity:

data("aquifer")

The data is a set of piezometric head (watertable pressure) observations of the Wolfcamp Aquifer in Texas (https://en.wikipedia.org/wiki/Hydraulic_head). Measures of pressure can be used to infer the flow of underground water, since water flows from high to low pressure areas.

These data were collected to evaluate potential flow of contamination related to a high level toxic waste repository in Texas. The Deaf Smith county site in Texas was one of three potential sites proposed for this repository. Beneath the site is a deep brine aquifer known as the Wolfcamp aquifer that may serve as a conduit of contamination leaking from the repository.

The data set consists of 85 georeferenced measurements of piezometric head. Possible applications of interpolation are to determine sites at risk and to quantify uncertainty of the interpolated surface, to evaluate the best locations for monitoring stations.

34.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Estimate a trend surface for the dataset experimenting with different polynomials.

  2. Create an interpolation grid, and use the function predict to interpolate the field using your chosen model. Plot the interpolated field using a method of your choice (e.g., ggplot2, plot_ly() for 3D plotting, etc.)

Activity Part II

  1. Which polynomial in your experiments provides the best fit (hint: consider the coefficient of multiple determination \(R^2\) and the standard error, in addition to the significance of the parameters). Justify your choice of a polynomial.

  2. Inspect the confidence intervals of your chosen model (these are an output of predict).

  3. Inspect the residuals of the model. Are they spatially random? If not, what would be the implications for spatial interpolation?

35 Spatially Continuous Data III

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

In the previous practice you were introduced to the concept of fields/spatially continuous data.

35.1 Learning objectives

Previously, in Chapter @ref(spatially-continuous-data-ii), we discussed some limitations of tile-based approaches, inverse distance weighting, and \(k\)-points mean. Particularly, these methods do not provide estimates of the uncertainty of point estimates when doing spatial interpolation. Trend surface analysis was introduced as a method for spatial interpolation that also provides estimates of the standard error. However, we saw that it is possible for the residuals of a trend surface model to be autocorrelated: this is an indication that there is still systematic variation in the residuals that was not fully captured by the model. To more fully exploit that residual pattern we need some additional tools. In this practice, you will learn some of said tools, as follows:

  1. About the implications of residual spatial pattern for predictions.
  2. The measurement of spatial dependence in fields.
  3. Variographic analysis.

35.2 Suggested reading

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 7. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

35.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(gstat)
library(spdep)
library(tidyverse)

Begin by loading the data file:

# We have been working with the Walker Lake dataset for the last few chapters.
data("Walker_Lake")  

You can verify the contents of the dataframe:

summary(Walker_Lake)
##       ID                  X               Y               V                U           T      
##  Length:470         Min.   :  8.0   Min.   :  8.0   Min.   :   0.0   Min.   :   0.00   1: 45  
##  Class :character   1st Qu.: 51.0   1st Qu.: 80.0   1st Qu.: 182.0   1st Qu.:  83.95   2:425  
##  Mode  :character   Median : 89.0   Median :139.5   Median : 425.2   Median : 335.00          
##                     Mean   :111.1   Mean   :141.3   Mean   : 435.4   Mean   : 613.27          
##                     3rd Qu.:170.0   3rd Qu.:208.0   3rd Qu.: 644.4   3rd Qu.: 883.20          
##                     Max.   :251.0   Max.   :291.0   Max.   :1528.1   Max.   :5190.10          
##                                                                      NA's   :195

35.4 Residual spatial pattern

In Chapter @ref{spatially-continuous-data-i} we used trend surface analysis for spatial interpolation. Trend surface analysis improves on methods such as Voronoi polygons, IDW, and \(k\)-point means by providing a built-in mechanism for estimating the uncertainty in the predictions. Let us quickly revisit this idea.

The objective of interpolation is to provide the following estimates: \[ \hat{z}_p + \hat{\epsilon}_p \]

Trend surface analysis provides interpolated values by generating a trend surface as follows: \[ \hat{z} = f(x, y) \] from which estimates of \(\hat{z}_p\) can be obtained by using suitable prediction coordinates \((x_p, y_p)\).

Next, although trend surface analysis does not provide an estimate of the prediction error \(\hat{\epsilon}_p\) (since we do not know the true value of the field at \(p\)), it provides confidence intervals for the prediction. In this way we can at the very least bound the prediction error as follows: \[ CI_{z_p} = [z_{p_{lwr}}, z_{p_{upr}}]. \]

As previously seen, however, use of trend surface analysis does not guarantee that the residuals of the model will be independent.

Let us revisit the model for Walker Lake.

As before, we first calculate the polynomial terms of the coordinates:

# Here we use `mutate()` to calculate the polynomial terms of the coordinates.
Walker_Lake <- mutate(Walker_Lake,
                        X3 = X^3, X2Y = X^2 * Y, X2 = X^2, 
                        XY = X * Y,
                        Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)

And proceed to estimate the following cubic trend surface model, which provided the best fit to the data:

# Recall use of the linear model for walker lake
WL.trend3 <- lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
                data = Walker_Lake) 
summary(WL.trend3)
## 
## Call:
## lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
##     data = Walker_Lake)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -564.19 -197.41    7.91  194.25  929.72 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.620e+00  1.227e+02  -0.070 0.944035    
## X3           1.533e-04  4.806e-05   3.190 0.001522 ** 
## X2Y          6.139e-05  3.909e-05   1.570 0.117000    
## X2          -6.651e-02  1.838e-02  -3.618 0.000330 ***
## X            9.172e+00  2.386e+00   3.844 0.000138 ***
## XY          -4.420e-02  1.430e-02  -3.092 0.002110 ** 
## Y            4.794e+00  2.040e+00   2.350 0.019220 *  
## Y2          -1.806e-03  1.327e-02  -0.136 0.891822    
## XY2          7.679e-05  2.956e-05   2.598 0.009669 ** 
## Y3          -4.170e-05  2.819e-05  -1.479 0.139759    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.7 on 460 degrees of freedom
## Multiple R-squared:  0.1719, Adjusted R-squared:  0.1557 
## F-statistic: 10.61 on 9 and 460 DF,  p-value: 5.381e-15

To examine the residuals, first we label them as “positive” or “negative”:

# The function `ifelse()` is used to label the residuals as "Positive" if they are 
# greater than zero, or "Negative" if they are zero or less.
Walker_Lake <- Walker_Lake %>%
  mutate(residual3 = ifelse(WL.trend3$residuals > 0, 
                            "Positive", 
                            "Negative"))

Once the residuals have been labeled we can be plotted as follows:

ggplot(data = Walker_Lake, 
       # Note color is only applied to results of positive or negative residuals
       aes(x = X, y = Y, color = residual3)) + 
  geom_point() +
  coord_equal() # Ensures equal scales for both axes

As seen before, there is considerable spatial autocorrelation as confirmed by Moran’s \(I\) coefficient:

# Take the coordinates of Walker Lake and convert to matrix.
WL.listw <- as.matrix(Walker_Lake[,2:3]) %>% 
  # Find the 5 nearest neighbors of each observations.
  knearneigh(k = 5) %>% 
  # Convert the nearest neighbors to `nb` object.
  knn2nb() %>% 
  # Convert the `nb` object into spatial weights. 
  nb2listw() 

# Use Moran's test on the residuals of the trend surface model 
moran.test(x = WL.trend3$residuals, listw = WL.listw)
## 
##  Moran I test under randomisation
## 
## data:  WL.trend3$residuals  
## weights: WL.listw    
## 
## Moran I statistic standard deviate = 17.199, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4633803457     -0.0021321962      0.0007325452

The fact that the residuals are not independent has important implications for prediction. Consider the following thought experiment.

Imagine that you were asked to guess whether the residual was positive or negative at the locations indicated with triangles in the figure. These are locations where an observation was not made, and we only have the interpolated value of the variable according to the trend surface model:

ggplot(data = Walker_Lake, 
       aes(x = X, y = Y)) +
  geom_point(aes(color = residual3)) +
  # Here we add coordinates for the triangles in the figure
  geom_point(data = data.frame(x = c(55, 25, 210, 227), y = c(200, 90, 90, 230)), 
             aes(x = x, y = y), shape = 17, size = 3) +
  coord_equal()

What would your guess be, and why? Would you say that your guess has a better than 50% chance of being right?

Now imagine that you were asked to guess whether the residual was positive or negative at the locations indicated with squares in the figure:

ggplot(data = Walker_Lake, 
       aes(x = X, y = Y)) +
  geom_point(aes(color = residual3)) +
  # Here we adding coordinates for the squares in the figure
  geom_point(data = data.frame(x = c(160, 240, 12, 120), y = c(38, 280, 240, 180)), 
             aes(x = x, y = y), shape = 15, size = 3) +
  coord_equal()

Again, what would your guess be, and why? Would you be able to guess this way if the residuals were random?

The fact that you can guess and be fairly sure about the quality of your guess is a consequence of the strong residual pattern. If the residuals were random, there would be no information left to use: the odds of a residual being positive or negative would essentially be 50%. However, when there is residual pattern, this information can be used to enhance the quality of your guesses about the residuals, or in other words, of the \(\hat{\epsilon}_p\) terms. At the very least you can guess whether they are positive or negative (therefore reducing their confidence intervals), but possibly you can learn even more from them, as will be seen later.

Before learning how to do this, however, we need to think more about the way in which we measure spatial pattern in spatially continuous data.

35.5 Measuring spatial dependence in spatially continuous data

In the preceding sections we used Moran’s \(I\) coefficient to measure spatial pattern. Moran’s \(I\) is, by design, a single-scale statistic, not unlike the case of nearest neighbor analysis in point patterns. The reason for this is that Moran’s \(I\) is limited to detecting pattern at the scale at which the spatial weights are defined: for instance, at the level of adjacency, contiguity, or \(k\)-nearest neighbors.

While this makes sense (mostly) in the case of area data, since the areas inherently introduce spatial discontinuities, it makes less sense in the case of fields, where the underlying process is typically smooth. In fact, more often we are interested in exploring the properties of the pattern over the field, not just the nearest neighbors.

One way of extending Moran’s \(I\) analysis to multiple scales is by means of the correlogram. The correlogram is simply a sequence of Moran’s \(I\) coefficients computed at different scales.

Consider for example the following sequence of coefficients, computed for \(k\)=10 neighbors to \(k\)=30 neighbors. Notice how the for loop calculates spatial weights using the designated number of neighbors, before calculating Moran’s \(I\).

# Initialize the values of k
k <- c(10:30)

# Initialize an empty vector to store the results of calculating Moran's I 
moranI <- numeric(length = length(k)) 

# Initialize an empty dataframe to store the values of k and moranI
correlogram <- data.frame(k, moranI) 

# A `for` loop is a way of repeating instructions a defined number of times, 
# here from 1 to the length of vector `k`.
for(i in 1:length(k)){
  listwk <- Walker_Lake[,2:3] %>%
    as.matrix() %>%
     # Use the ith element of vector `k` to find the nearest neighbors
    knearneigh(k = k[i]) %>%
    knn2nb() %>%
    nb2listw()
  
  # Moran test for residuals
  m <- moran.test(x = WL.trend3$residuals, listw = listwk)
  
  # Assign the value of Moran's I statistic to the ith element of vector correlogram
  correlogram$moranI[i] <- m$estimate[1] 
}

Given the values of Moran’s \(I\) at different scales (i.e., values of \(k\)), the correlogram can be plotted as:

ggplot(data = correlogram,
       aes(x = k, 
           y = moranI)) + 
  geom_point()

As can be seen in the plot, spatial autocorrelation tends to decline as the number of nearest neighbors used in the test grows - in other words, as the scale of the test increases. This is a common occurrence: when autocorrelation is present, observations tend to be more similar to their closest neighbors than to their more distant neighbors.

The use of \(k\)-nearest neighbors points to a problem, however. The scale of the process does not depend on distance, which would be a more natural metric for a continuous process. In this case, \(k\)-nearest neighbors were used to ensure that each sum in the coefficient had the same number of observations. However, this means that “neighborhoods” will be geographically smaller where the observations are more dense, and larger where they are sparse.

While this issue is not insurmountable (for instance, instead of \(k\)-nearst neighbors we could have used the neighbors found at a certain distance), it points out to the fact that Moran’s \(I\) is not by design well suited for the analysis of spatially continuous data.

A different approach, known as variographic analysis, is introduced next.

35.6 Variographic analyisis

To introduce variographic analysis it is worthwhile to recall the definition of the covariance between two variables, say \(X\) and \(Y\): \[ C(X,Y) = E[{(X_i^2 - \bar{X})(Y_i^2 - \bar{Y})}] \] Where \(\bar{X}\) and \(\bar{Y}\) are the means of \(X\) and \(Y\) respectively.

The expectation operator \(E[]\) turns out to be the mean: \[ C(X,Y) = \frac{1}{n}\sum_{i=1}^{n}{(X_i^2 - \bar{X})(Y_j^2 - \bar{Y})} \]

The observations \(X_i\) and \(Y_i\) in the covariance formula can be seen as a points in a scatterplot, with the axes shifted to the means of \(X\) and \(Y\), as shown in Figure @ref{fig:covariance-as-scatterplot}.

\label{fig:covariance-as-scatterplot}Observations of the covariance as a scatterplot

(#fig:covariance-as-a-scatterplot)Observations of the covariance as a scatterplot

The autocovariance of variable \(z\) can be defined in a similar way, the difference being that instead of two variables, it is the covariance of a variable with itself but at a different location (i.e., between locations \(i\) and \(j\)): \[ C(z_i,z_j) = E[{(z_i^2 - \bar{z})(z_j^2 - \bar{z})}] \]

To implement the spatial autocovariance we need some criterion to explicitly define the spatial relationship between locations \(i\) and \(j\). A useful criterion in this case is as follows:

\[ w_{ij}(h)=\bigg\{\begin{array}{l l} 1\text{ if } d_{ij} = h\\ 0\text{ otherwise}\\ \end{array} \] In other words, \(i\) and \(j\) are considered to be spatially related for the purposes of calculating the autocovariance, if the distance between the two locations is equal to some predefined spatial lag \(h\).

The above criterion makes explicit the assumption that the autocovariance is a function of the separation \(h\) between two observations, but not of other factors, such as the angle between observations. This assumption is called isotropy.

Further, if we assume that the variance of \(z\) is constant, and the correlation between observations does not depend on location (an assumption called intrinsic stationarity), we can pool observations from across the map to create a scatterplot to form the basis of the autocovariance calculations.

Consider the (regular) arrangement of observations spaced at \(h\) in Figure @ref{fig:autocovariance}. Each observation generally has four neighbors, with the exception of those in the edges, which have fewer neighbors at spatial lag \(h\). This means that most observations will contribute four points to the scatterplot (\(z_i\) and \(z_j\), \(z_k\), \(z_l\), and \(z_m\)), and others will contribute three or at least two (those in the corners).

\label{fig:autocovariance}Finding spatial pairs for the calculation of the autocovariance

(#fig:autocovariance)Finding spatial pairs for the calculation of the autocovariance

Given those pairs of observations, the autocovariance at lag \(h\) can be calculated as: \[ C_{z}(h) = \frac{\sum_{i=1}^{n}{w_{ij}(h)(z_i^2 - \bar{z})(z_j^2 - \bar{z})}}{\sum_{i=1}^n{w_{ij}(h)}} \]

Changing the spatial lag \(h\) allows us to calculate the autocovariance at different scales. The plot of the autocovariance at different scales is called a covariogram.

A related quantity that is more commonly used (mainly for historical reasons) is the semivariance.

The semivariance is defined as follows, calculated based on the difference between \(z_i\) and \(z_j\): \[ \hat{\gamma}_{z}(h) = \frac{\sum_{i=1}^{n}{w_{ij}(h)(z_i - z_j)^2}}{2\sum_{i=1}^n{w_{ij}(h)}} \]

The plot of the semivariance at different lags \(h\) is called a semivariogram.

The covariogram and semivariogram are related by the following formula: \[ C_{z}(h) =\sigma^2 - \hat{\gamma}_{z}(h) \] where \(\sigma^2\) is the sample variance.

The condition that \(d_{ij} = h\) is, with the exception of gridded data, too strict, and is often relaxed in the following way:

\[ w_{ij}(\tilde{h})=\bigg\{\begin{array}{l l} 1\text{ if } d_{ij}\simeq h\\ 0\text{ otherwise}\\ \end{array} \]

In this way, the distance between observations \(i\) and \(j\) does not need to be exactly, but can be an approximation. The approximation can be defined explicitly as follows:

\[ w_{ij}(\tilde{h})=\bigg\{\begin{array}{l l} 1\text{ if } h - \Delta h < d_{ij} < h + \Delta h\\ 0\text{ otherwise}\\ \end{array} \]

Instead of forming pairs with observations that are at exactly a distance \(h\) (which would lead in many cases to too few pairs), pairs are formed with observations at approximately lag \(h\) (or \(\tilde{h}\)), with a tolerance given by \(\Delta h\).

Analysis based on the semivariogram (called variographic analysis) is implemented in R in the gstat package.

We will illustrate the use of the semivariogram by means of the Walker Lake data. The package gstat accepts simple features objects of the sf package, so we convert our dataframe into such an object:

Walker_Lake.sf <- st_as_sf(Walker_Lake, coords = c("X", "Y"))
class(Walker_Lake.sf)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

The empirical semivariogram is calculated by means of the gstat::variogram function, as follows:

# `variogram()` calculates the sample semivariogram from data, 
# or if a linear model is given, for the residuals; in this case, 
# the formula `V ~ 1` means that we are not using a model
variogram_z <- variogram(V ~ 1, data = Walker_Lake.sf) 

#Note we are plotting the data of `variogram_z`
ggplot(data = variogram_z,
       aes(x = dist, 
           y = gamma)) + 
  geom_point() + 
  # Add labels to indicate the number of pairs of observations used 
  # in the calculation of each point in the variogram
  geom_text(aes(label = np), 
            nudge_y = -1500) + 
  # Add labels to axes
  xlab("Distance") +
  ylab("Semivariance")

The numbers indicate the number of pairs of observations used to calculate the semivariance at the corresponding lag.

Since the sample variance is:

# We are calculating the variance of X
s2 <- var(Walker_Lake$V) 
s2
## [1] 90694.59

It follows that the covariogram in this case is:

ggplot(data = variogram_z, 
       aes(x = dist, 
           y = s2 - gamma)) +
  geom_point() + 
  geom_text(aes(label = np), 
            nudge_y = -1500) +
  xlab("Distance") + 
  ylab("Autocovariance")

As expected, the autocovariance (and hence, the autocorrelation) is stronger at short spatial lags, and declines at larger spatial lags.

The above plots are the empirical semivariogram and covariogram. These plots are used to model a theoretical semivariogram, a function that can be used to estimate spatial dependence at any lag within the domain of the - and not just at the distances for which we have points in the empirical variogram.

Since the semivariogram is the expectation of the square, the function selected for modeling the theoretical semivariogram must be non-negative. Several functions satisfy this condition, a list of which are available in gstat as shown below:

# This function generates a variogram mode. Here, we are able to view 
# the list of possible models for a semivariogram
vgm() 
##    short                                      long
## 1    Nug                              Nug (nugget)
## 2    Exp                         Exp (exponential)
## 3    Sph                           Sph (spherical)
## 4    Gau                            Gau (gaussian)
## 5    Exc        Exclass (Exponential class/stable)
## 6    Mat                              Mat (Matern)
## 7    Ste Mat (Matern, M. Stein's parameterization)
## 8    Cir                            Cir (circular)
## 9    Lin                              Lin (linear)
## 10   Bes                              Bes (bessel)
## 11   Pen                      Pen (pentaspherical)
## 12   Per                            Per (periodic)
## 13   Wav                                Wav (wave)
## 14   Hol                                Hol (hole)
## 15   Log                         Log (logarithmic)
## 16   Pow                               Pow (power)
## 17   Spl                              Spl (spline)
## 18   Leg                            Leg (Legendre)
## 19   Err                   Err (Measurement error)
## 20   Int                           Int (Intercept)

The anatomy of a semivariogram includes a range, a sill, and possibly a nugget. These elements are shown in Figure @ref{fig:semivariogram}.

\label{fig:semivarigoram}Anatomy of a semivariogram

(#fig:anatomy-semivariogram)Anatomy of a semivariogram

Since the semivariogram is calculated based on the square of the differences \(z_i - z_j\), the smaller the semivariance is, the more similar observations tend to be. In principle, the semivariogram begins at zero, because at distance zero an observation is identical to itself (i.e., \(z_i - z_i\)). The range is the distance at which the sill is reached. The sill, on the other hand, is the point at which the semivariance becomes simply the variance, meaning that there is no more or less similarity between observations than would be implied by the variance of the sample.

An additional element is the nugget. While the semivariogram in principle begins at zero, sometime discontinuities near the origin can be observed. The terminology is from mining, and reflects the fact that a nugget could be very different from the material around it, hence the jump in the semivariogram.

Some theoretical functions are shown next.

Exponential semivariogram:

# We use "exp" to denote the use of an exponential semivariogram. 
# Refer to the list on line 297 and explore the different outcomes 
# of the listed variogram models! 
plot(variogramLine(vgm(1, 
                       "Exp",
                       1), 
                   10),
     type = 'l') 

Spherical semivariogram:

plot(variogramLine(vgm(1, 
                       "Sph",
                       1), 
                   10), 
     type = 'l')

Gaussian semivariogram:

plot(variogramLine(vgm(1, 
                       "Gau", 
                       1), 
                   10), 
     type = 'l')

These plots illustrate some differences in the behavior of the models. For identical parameters, the Gaussian model provides smoother changes near the origin. The spherical model reaches the sill more rapidly than the other models.

To fit a theoretical semivariogram to the empirical one, the function fit.variogram is used:

# `fit_variogram` selects the type of model that will fit 
# the empirical semivariogram best
variogram_z.t <- fit.variogram(variogram_z, model = vgm("Exp")) 

The results of which can be plotted after passing the model the the function variogramLine:

# Notice how 'maxdist' is 130, and the model does not exceed that value.
gamma.t <- variogramLine(variogram_z.t, 
                         maxdist = 130) 

# Plot
ggplot(data = variogram_z, 
       aes(x = dist, 
           y = gamma)) +
  geom_point(size = 3) + 
  geom_line(data = gamma.t,
            aes(x = dist, 
                y = gamma)) +
  xlab("Distance") + 
  ylab("Semivariance")

A set of models can be passed as an argument to fit.variogram, in which case the value (output) of the function is the model that provides the best fit to the empirical semivariogram:

variogram_z.t <- fit.variogram(variogram_z, 
                               # Models to choosing the best fit
                               model = vgm("Exp", 
                                           "Sph", 
                                           "Gau")) 
variogram_z.t
##   model     psill    range
## 1   Nug  4045.567  0.00000
## 2   Exp 90703.773 12.52591

In this case, it can be seen that the best fitting model is the exponential, as follows:

gamma.t <- variogramLine(variogram_z.t, 
                         maxdist = 130)

# Plot 
ggplot(data = variogram_z, 
       aes(x = dist, 
           y = gamma)) +
  geom_point(size = 3) + 
  geom_line(data = gamma.t,
            aes(x = dist, 
                y = gamma)) +
  xlab("Distance") + 
  ylab("Semivariance")

For comparison, we will do the variographic analysis of a simulated random dataset.

Generate coordinates for observations and expand on a grid:

#We are generating a regular sequence of coordinates by means of 'seq' 
x <- seq(from = 0, 
         to = 250, 
         by = 10)
y <- seq(from = 0, 
         to = 290, 
         by = 10)

# Create a data frame `df` to store these values
df <- expand.grid(x = x, 
                  y = y)  

Then, create a random variable for this coordinates:

# `set.seed()` is used for replicability: it uses the seed 
# in the argument for generating random numbers
set.seed(100) 
df$z <- rnorm(n = 780, mean = 500, sd = 300)

Finally, convert to a simple features object:

df <- st_as_sf(df, coords = c("x", "y")) 

The empirical variogram is:

# Calculate the variogram
variogram_df <- variogram(z ~ 1, data = df)

# Plot
ggplot(data = variogram_df, aes(x = dist, y = gamma)) +
  geom_point() + 
  geom_text(aes(label = np), nudge_y = -1500) + 
  ylim(c(0, 98100)) +
  xlab("Distance") + 
  ylab("Semivariance")  

The range of the semivariogram appears to be zero, or alternatively, there seems to be a pure nugget effect. This is as expected. Since the data are spatially random, they are not more similar at shorter distances than they would be at longer distances.

36 Activity 17: Spatially Continuous Data III

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

36.1 Practice questions

Answer the following questions:

  1. What is a correlogram?
  2. What is the relationship between the autocovariance and the semivariance?
  3. Describe the elements of a semivariogram.
  4. Why is it important to consider the number of pairs used in the calculation of the semivariance?

36.2 Learning objectives

In this activity, you will:

  1. Calculate and plot empirical semivariograms.
  2. Estimate and plot theoretical semivariograms.
  3. Discuss the results of variographic analysis.

36.3 Suggested reading

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 7. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

36.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity (load other packages as appropriate).

library(isdas)
library(gstat)
library(sf)
library(spdep)
library(tidyverse)

Load dataset:

data("aquifer")

Convert to a simple features object:

aquifer.sf <- st_as_sf(aquifer, coords = c("X", "Y"))

The data is a set of piezometric head (watertable pressure) observations of the Wolfcamp Aquifer in Texas (https://en.wikipedia.org/wiki/Hydraulic_head). Measures of pressure can be used to infer the flow of underground water, since water flows from high to low pressure areas.

These data were collected to evaluate potential flow of contamination related to a high level toxic waste repository in Texas. The Deaf Smith county site in Texas was one of three potential sites proposed for this repository. Beneath the site is a deep brine aquifer known as the Wolfcamp aquifer that may serve as a conduit of contamination leaking from the repository.

The data set consists of 85 georeferenced measurements of piezometric head. Possible applications of interpolation are to determine sites at risk and to quantify uncertainty of the interpolated surface, to evaluate the best locations for monitoring stations.

36.5 Activity

NOTE: Activities include technical “how to” tasks/questions. Usually, these ask you to practice using the software to organize data, create plots, and so on in support of analysis and interpretation. The second type of questions ask you to activate your brainware and to think geographically and statistically.

Activity Part I

  1. Obtain and plot the empirical semivariogram for the head in the Wolfcamp Aquifer dataset.

  2. Estimate a trend surface of your choice, and obtain and plot an empirical semivariogram using the residuals. How would you interpret this semivariogram?

  3. Estimate and plot a theoretical semivariogram model for the residual variogram.

Activity Part II

  1. What is your interpretation of the semivariograms above?

  2. How would you use the information provided by the variographic analysis above to improve your predictions (spatially interpolated values) of the field?

37 Spatially Continuous Data IV

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

37.1 Learning objectives

In the previous practice you were introduced to the concept of variographic analysis for fields/spatially continuous data. In this practice, we will learn:

  1. How to use residual spatial pattern to estimate prediction errors.
  2. Kriging: a method for optimal predictions.

37.2 Suggested reading

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 12. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

37.3 Preliminaries

As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity:

library(isdas)
library(gstat)
library(plotly)
library(spdep)
library(tidyverse)

Begin by loading the data file:

data("Walker_Lake")

You can verify the contents of the dataframe:

summary(Walker_Lake)
##       ID                  X               Y               V                U           T      
##  Length:470         Min.   :  8.0   Min.   :  8.0   Min.   :   0.0   Min.   :   0.00   1: 45  
##  Class :character   1st Qu.: 51.0   1st Qu.: 80.0   1st Qu.: 182.0   1st Qu.:  83.95   2:425  
##  Mode  :character   Median : 89.0   Median :139.5   Median : 425.2   Median : 335.00          
##                     Mean   :111.1   Mean   :141.3   Mean   : 435.4   Mean   : 613.27          
##                     3rd Qu.:170.0   3rd Qu.:208.0   3rd Qu.: 644.4   3rd Qu.: 883.20          
##                     Max.   :251.0   Max.   :291.0   Max.   :1528.1   Max.   :5190.10          
##                                                                      NA's   :195

37.4 Using residual spatial pattern to estimate prediction errors

Previously, in Chapter @ref{spatially-continuous-data-ii} we discussed how to interpolate a field using trend surface analysis; we also saw how that method may lead to residuals that are not spatially independent.

The implication of non-random residuals is that there is systematic residual pattern that the model did not capture; This, in turn, means that there is at least some information that can still be extracted from the residuals. Again, we will use the case of Walker Lake to explore one way to do this.

As before, we first calculate the polynomial terms of the coordinates to fit a trend surface to the data:

Walker_Lake <- mutate(Walker_Lake,
                      X3 = X^3, X2Y = X^2 * Y, X2 = X^2, 
                      XY = X * Y,
                      Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)

Given the polynomial expansion, we can proceed to estimate the following cubic trend surface model, which we already know provided the best fit to the data:

WL.trend3 <- lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
                data = Walker_Lake)
summary(WL.trend3)
## 
## Call:
## lm(formula = V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
##     data = Walker_Lake)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -564.19 -197.41    7.91  194.25  929.72 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.620e+00  1.227e+02  -0.070 0.944035    
## X3           1.533e-04  4.806e-05   3.190 0.001522 ** 
## X2Y          6.139e-05  3.909e-05   1.570 0.117000    
## X2          -6.651e-02  1.838e-02  -3.618 0.000330 ***
## X            9.172e+00  2.386e+00   3.844 0.000138 ***
## XY          -4.420e-02  1.430e-02  -3.092 0.002110 ** 
## Y            4.794e+00  2.040e+00   2.350 0.019220 *  
## Y2          -1.806e-03  1.327e-02  -0.136 0.891822    
## XY2          7.679e-05  2.956e-05   2.598 0.009669 ** 
## Y3          -4.170e-05  2.819e-05  -1.479 0.139759    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.7 on 460 degrees of freedom
## Multiple R-squared:  0.1719, Adjusted R-squared:  0.1557 
## F-statistic: 10.61 on 9 and 460 DF,  p-value: 5.381e-15

We can next visualize the residuals which, as you can see, do not appear to be random

plot_ly(x = ~Walker_Lake$X, 
        y = ~Walker_Lake$Y, 
        z = ~WL.trend3$residuals, 
        color = ~WL.trend3$residuals < 0, 
        colors = c("blue", "red"), 
        type = "scatter3d")
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Now we will create an interpolation grid:

# The function `sequence()` create a sequence of values from - to 
# using by as the step increment. In this case, we generate a grid
# with points that are 2.5 m apart.
X.p <- seq(from = 0.1, to = 255.1, by = 2.5)
Y.p <- seq(from = 0.1, to = 295.1, by = 2.5)
df.p <- expand.grid(X = X.p, Y = Y.p)

WE can add the polynomial terms to this grid. Since our trend surface model was estimated using the cubic polynomial, we add those terms to the dataframe:

df.p <- mutate(df.p, X3 = X^3, X2Y = X^2 * Y, X2 = X^2, 
               XY = X * Y, 
               Y2 = Y^2, XY2 = X * Y^2, Y3 = Y^3)

The interpolated cubic surface is obtained by using the model and the interpolation grid as newdata:

# The function `predict()` is used to make predictions given a model 
# and a possibly new dataset, different from the one used for estimation 
# of the model.
WL.preds3 <- predict(WL.trend3, 
                     newdata = df.p, 
                     se.fit = TRUE, 
                     interval = "prediction", 
                     level = 0.95)

The surface is converted into a matrix for 3D plotting:

z.p3 <- matrix(data = WL.preds3$fit[,1], 
               nrow = length(Y.p), 
               ncol = length(X.p), 
               byrow = TRUE)

And plot:

WL.plot3 <- plot_ly(x = ~X.p, 
                    y = ~Y.p, 
                    z = ~z.p3, 
                    type = "surface", 
                    colors = "YlOrRd") %>% 
  layout(scene = list(aspectmode = "manual",
                      aspectratio = list(x = 1, 
                                         y = 1,
                                         
                                         z = 1)))
WL.plot3

The trend surface provides a smooth estimate of the field. However, it is not sufficient to capture all systematic variation, and fails to produce random residuals.

A possible way of enhancing this approach to interpolation is to exploit the information that remains in the residuals, for instance by the use of \(k\)-point means.

We can illustrate this as follows. To interpolate the residuals, we first need the set of target points (the points for the interpolation), as well as the source (the observations):

# We will use the prediction grid we used above to interpolate the residuals 
target_xy = expand.grid(x = X.p, 
                        y = Y.p) %>%
  st_as_sf(coords = c("x", "y"))

# Convert the `Walker_Lake` dataframe to a simple features object using as follows:
Walker_Lake.sf <- Walker_Lake %>% 
  st_as_sf(coords = c("X", "Y"))

# Append the residuals to the table
Walker_Lake.sf$residuals <- WL.trend3$residuals

It is possible now to use the kpointmean function to interpolate the residuals, for instance using \(k=5\) neighbors:

kpoint.5 <- kpointmean(source_xy = Walker_Lake.sf, 
                       target_xy = target_xy, 
                       z = residuals, 
                       k = 5)
## projected points

Given the interpolated residuals, we can join them to the cubic trend surface, as follows:

z.p3 <- matrix(data = WL.preds3$fit[,1] + kpoint.5$z,
               nrow = length(Y.p), 
               ncol = length(X.p), 
               byrow = TRUE)

This is now the interpolated field that combines the trend surface and the estimated residuals:

WL.plot3 <- plot_ly(x = ~X.p, 
                    y = ~Y.p, 
                    z = ~z.p3,
                    type = "surface", 
                    colors = "YlOrRd") %>% 
  layout(scene = list(aspectmode = "manual",
                      aspectratio = list(x = 1, 
                                         y = 1, 
                                         z = 1)))
WL.plot3

Of all the approaches that we have seen so far, this is the first that provides a genuine estimate of the following: \[ \hat{z}_p + \hat{\epsilon}_p \]

With trend surface analysis providing a smooth estimator of the underlying field: \[ \hat{z}_p = f(x_p, y_p) \]

And \(k\)-point means providing an estimator of: \[ \hat{\epsilon}_p \]

A question is how to decide the number of neighbors to use in the calculation of the \(k\)-point means. As previously discussed, \(k\)=1 becomes identical to Voronoi polygons, and \(k = n\) becomes the global mean.

A second question concerns the way the average is calculated. As variographic analysis demonstrates, it is possible to estimate the way in which spatial dependence weakens with distance. Why should more distant points be weighted equally? The answer is, there is no reason why they should, and in fact, variographic analysis elegantly solves this, as well the question of how many points to use: all of them, with varying weights.

Next, we will introduce kriging, a method for optimal prediction that is based on the use of variographic analysis.

37.5 Kriging: a method for optimal prediction.

To introduce the method known as kriging, we will begin by positing a situation as follows:

\[ \hat{z}_p + \hat{\epsilon}_p = \hat{f}(x_p, y_p) + \hat{\epsilon}_p \]

where \(\hat{f}(x_p, y_p)\) is a smooth estimator of an underlying field.

We aim to predict \(\hat{\epsilon}_p\) based on the observed residuals. We use an expression similar to the one used for IDW and \(k\)-point means in Chapter @ref{spatially-continuous-data-i} (we will use \(\lambda\) for the weights to avoid confusing the the weights in variographic analysis):

\[ \hat{\epsilon}_p = \sum_{i=1}^n {\lambda_{pi}\epsilon_i} \]

That is, \(\hat{\epsilon}_p\) is a linear combination of the prediction residuals from the trend: \[ \epsilon_i = z_i - \hat{f}(x_i, y_i) \]

It is possible to define the following expected mean squared error, or prediction variance: \[ \sigma_{\epsilon}^2 = E[(\hat{\epsilon}_p - \epsilon_i)^2] \]

The prediction variance measures how close, on average, the prediction error is to the residuals.

The prediction variance can be decomposed as follows: \[ \sigma_{\epsilon}^2 = E[\hat{\epsilon}_p] + E[\epsilon_i] - 2E[\hat{\epsilon_i\epsilon}_p] \]

It turns out (we will not show the detailed derivation, but it can be consulted here), that the expression for the prediction variance depends on the weights: \[ \sigma_{\epsilon}^2 = \sum_{i=1}^n \sum_{j=1}^n{\lambda_{ip}\lambda_{jp}C_{ij}} + \sigma^2 + 2\sum_{i=1}^{n}{\lambda_{ip}C_{ip}} \] where \(C_{ij}\) is the autocovariance between observations at \(i\) and \(j\), and \(C_{ip}\) is the autocovariance between the observation at \(i\) and prediction location \(p\).

Fortunately for us, the semivariogram and the autocovariance is straightforward: \[ C_{z}(h) =\sigma^2 - \hat{\gamma}_{z}(h) \]

This means that, given the distance \(h\) between \(i\) and \(j\), and \(i\) and \(p\), we can use a semivariogram to obtain the autocovariances needed to calculate the prediction variance. We are still missing, however, the weights \(\lambda\), which are not known a priori.

These weights can be obtained if we use the following rules:

The expectation of the prediction errors is zero (unbiassedness) Find the weights \(lambda\) that minimize the prediction variance (optimal estimator).

This makes sense, since we would like our predictions to be unbiased (i.e., accurate) and as precise as possible, that is, to have the smallest variance (recall the discussion about accuracy and precision in Chapter @ref{spatially-continuous-data-iii}).

Again, solving the minimization problem is beyond the scope of our presentation, but it suffices to say that the result is as follows:

\[ \mathbf{\lambda}_p = \mathbf{C}^{-1}\mathit{c}_{p} \]

where \(\mathbf{C}\) is the covariance matrix, and \(\mathit{c}_{p}\) is the covariance vector for location \(p\).

In summary, kriging is a method to optimally estimate the value of a variable at \(p\) as a weighted sum of the observations of the same variable at locations \(i\). This method is known to have the properties of Best (in the sense that it minimizes the variance) Linear (because predictions are a linear combination of weights) Unbiased (since the estimators of the prediction errors are zero) Predictor, or BLUP.

Kriging is implemented in the package gstat as follows.

To put kriging to work we must first conduct variographic analysis of the residuals. The function variogram uses as an argument a simple features object that we can create as follows:

Walker_Lake.sf <- Walker_Lake %>%
  st_as_sf(coords = c("X", "Y"),
           # Remove set to false to retain the X and Y coordinates 
           # in the dataframe after they are converted to simple features
           remove = FALSE) 

The variogram of the residuals can be obtained by specifying a trend surface in the formula:

variogram_v <- variogram(V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3, 
                         data = Walker_Lake.sf)

# Plot 
ggplot(data = variogram_v, 
       aes(x = dist, 
           y = gamma)) +
  geom_point() + 
  geom_text(aes(label = np), 
            # Nudge the labels away from the points
            nudge_y = -1500) +
  xlab("Distance") +
  ylab("Semivariance")

You can verify that the semivariogram above corresponds to the residuals by repeating the analysis directly on the residuals. First join the residuals to the SpatialPointsDataFrame:

Walker_Lake.sf$e <- WL.trend3$residuals

And then calculate the semivariogram and plot:

variogram_e <- variogram(e ~ 1, 
                         data = Walker_Lake.sf)

# Plot 
ggplot(data = variogram_e, 
       aes(x = dist, 
           y = gamma)) +
  geom_point() + 
  geom_text(aes(label = np), 
            nudge_y = -1500) +
  xlab("Distance") + 
  ylab("Semivariance")

The empirical semivariogram is used to estimate a semivariogram function:

variogram_v.t <- fit.variogram(variogram_v, model = vgm("Exp", "Sph", "Gau"))
variogram_v.t
##   model   psill    range
## 1   Nug     0.0 0.000000
## 2   Exp 85554.4 9.910429

The variogram function plots as follows:

gamma.t <- variogramLine(variogram_v.t, maxdist = 130)

# Plot
ggplot(data = variogram_v,
       aes(x = dist, 
           y = gamma)) +
  geom_point(size = 3) + 
  geom_line(data = gamma.t,
            aes(x = dist, 
                y = gamma)) +
  xlab("Distance") + 
  ylab("Semivariance")

We will convert the prediction grid to a simple features object:

df.sf <- df.p %>%
  st_as_sf(coords = c("X", "Y"),
           remove = FALSE)

Then, we can krige the field as follows (ensure that packages sf and stars are installed):

V.kriged <- krige(V ~ X3 + X2Y + X2 + X + XY + Y + Y2 + XY2 + Y3,
                  Walker_Lake.sf, 
                  df.sf, 
                  variogram_v.t)
## [using universal kriging]

Extract the predictions and prediction variance from the object V.kriged:

V.km <- matrix(data = V.kriged$var1.pred,
               nrow = 119,
               ncol = 103, 
               byrow = TRUE)
V.sm <- matrix(data = V.kriged$var1.var,
               nrow = 119,
               ncol = 103,
               byrow = TRUE)

We can now plot the interpolated field:

V.km.plot <- plot_ly(x = ~X.p,
                     y = ~Y.p, 
                     z = ~V.km, 
                     type = "surface", 
                     colors = "YlOrRd") %>% 
  layout(scene = list(aspectmode = "manual", 
                      aspectratio = list(x = 1,
                                         y = 1,
                                         z = 1)))
V.km.plot

Also, we can plot the kriging standard errors (the square root of the prediction variance). This gives an estimate of the uncertainty in the predictions:

V.sm.plot <- plot_ly(x = ~X.p, 
                     y = ~Y.p,
                     z = ~sqrt(V.sm), 
                     type = "surface", 
                     colors = "YlOrRd") %>% 
  layout(scene = list(aspectmode = "manual",
                      aspectratio = list(x = 1, 
                                         y = 1, 
                                         z = 1)))
V.sm.plot

Where are predictions more/less precise?

38 Activity 18: Spatially Continuous Data IV

NOTE: The source files for this book are available with companion package {isdas}. The source files are in Rmarkdown format and packed as templates. These files allow you execute code within the notebook, so that you can work interactively with the notes.

38.1 Practice questions

Answer the following questions:

  1. What does “Best” in BLUP mean?
  2. What is the advantage of kriging over other interpolation approaches?
  3. How is the the autocovariance used to produce optimal predictions?

38.2 Learning objectives

In this activity, you will:

  1. Conduct variograpic analysis.

  2. Use kriging to interpolate a field.

38.3 Suggested reading

  • Bailey TC and Gatrell AC (1995) Interactive Spatial Data Analysis, Chapters 5 and 6. Longman: Essex.
  • Bivand RS, Pebesma E, and Gomez-Rubio V (2008) Applied Spatial Data Analysis with R, Chapter 8. Springer: New York.
  • Brunsdon C and Comber L (2015) An Introduction to R for Spatial Analysis and Mapping, Chapter 6, Sections 6.7 and 6.8. Sage: Los Angeles.
  • Isaaks EH and Srivastava RM (1989) An Introduction to Applied Geostatistics, Chapter 12. Oxford University Press: Oxford.
  • O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapters 9 and 10. John Wiley & Sons: New Jersey.

38.4 Preliminaries

It is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:

rm(list = ls())

Note that ls() lists all objects currently on the workspace.

Load the libraries you will use in this activity (load other packages as appropriate).

library(isdas)
library(gstat)
library(sf)
library(tidyverse)

Load dataset:

data("aquifer")

The data is a set of piezometric head (watertable pressure) observations of the Wolfcamp Aquifer in Texas (https://en.wikipedia.org/wiki/Hydraulic_head). Measures of pressure can be used to infer the flow of underground water, since water flows from high to low pressure areas.

These data were collected to evaluate potential flow of contamination related to a high level toxic waste repository in Texas. The Deaf Smith county site in Texas was one of three potential sites proposed for this repository. Beneath the site is a deep brine aquifer known as the Wolfcamp aquifer that may serve as a conduit of contamination leaking from the repository.

The data set consists of 85 georeferenced measurements of piezometric head. Possible applications of interpolation are to determine sites at risk and to quantify uncertainty of the interpolated surface, to evaluate the best locations for monitoring stations.

Convert to a SpatialPointsDataFrame:

aquifer.sf <- aquifer %>%
  st_as_sf(coords = c("X", "Y"),
           remove = FALSE)

38.5 Activity

Capstone Activity

This is a capstone activity where you can work free-style on a data set of your choice, and put in practice what you have learned with respect to the analysis of spatially continuous/ field data.

  1. Partner with a fellow student to analyze the dataset provided.

  2. Use kriging to interpolate the underlying field. Justify your modeling choices.

  3. Discuss your results.

  4. Imagine that you had to compare different modeling approaches (e.g., kriging, IDW). Propose a protocol to decide which method is more accurate.

Anselin, Luc. 1988. Spatial Econometrics: Methods and Models. Book. Dordrecht: Kluwer.
———. 1995. “Local Indicators of Spatial Association - LISA.” Journal Article. Geographical Analysis 27: 93–115.
Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2016. Spatial Point Patterns: Methodology and Applications with r. Book. Chapman; Hall/CRC.
Bailey, T. C., and A. C. Gatrell. 1995. Interactive Spatial Data Analysis. Book. Essex: Addison Wesley Longman.
Bivand, R. S., E. J. Pebesma, and V. Gómez-Rubio. 2008. Applied Spatial Data Analysis with r. Book. New York: Springer Science+Business Media.
Brunsdon, C., A. S. Fotheringham, and M. E. Charlton. 1996. “Geographically Weighted Regression: A Method for Exploring Spatial Nonstationarity.” Journal Article. Geographical Analysis 28 (4): 281–98. ISI:A1996VL03500001.
Brunsdon, Chris, and Lex Comber. 2015. An Introduction to r for Spatial Analysis and Mapping. Book. Sage.
Cressie, N. A. C. 1993. Statistics for Spatial Data. Book. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons.
Farber, S., and A. Páez. 2007. “A Systematic Investigation of Cross-Validation in GWR Model Estimation: Empirical Analysis and Monte Carlo Simulations.” Journal Article. Journal of Geographical Systems 9 (4): 371–96. C:/Papers/Journal of Geographical Systems/Journal of Geographical Systems (2007) 9 (4) 371-396.pdf.
Fotheringham, A. S., and C. Brunsdon. 1999. “Local Forms of Spatial Analysis.” Journal Article. Geographical Analysis 31 (4): 340–58.
Getis, A., and J. K. Ord. 1992. “The Analysis of Spatial Association by Use of Distance Statistics.” Journal Article. Geographical Analysis 24 (3): 189–206. ISI:A1992JF93400001 C:/Papers/Geographical Analysis/Geographical Analysis (1992) 24 (3) 189-206.pdf.
Geyer, Charles J, and Jesper Møller. 1994. “Simulation Procedures and Likelihood Inference for Spatial Point Processes.” Scandinavian Journal of Statistics, 359–73.
Griffith, D. A. 1988. Advanced Spatial Statistics: Special Topics in the Exploration of Quantitative Spatial Data Series. Book. Dordrecht: Kluwer.
Haase, P. 1995. “Spatial Pattern Analysis in Ecology Based on Ripley’s k-Function: Introduction and Methods of Edge Correction.” Journal of Vegetation Science 6 (4): 575–82.
Haining, R. 1990. Spatial Data Analysis in the Social and Environmental Sciences. Book. Cambridge: Cambridge University Press.
Isaaks, E. H., and R. M. Srivastava. 1989. Applied Geostatistics. Book. New York: Oxford University Press.
Lloyd, Christopher D. 2010. Local Models for Spatial Analysis. CRC press.
Lovelace, Robin, Jacub Nowosad, and Jannes Muenchow. 2019. Geocomputation with r. Book. CRC Press.
McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. Book. Vol. 122. CRC Press.
McGrew Jr, J Chapman, and Charles B Monroe. 2009. An Introduction to Statistical Problem Solving in Geography. Book. 2nd. Edition. Long Grove, Illinois: Waveland Press.
McMillen, D. P. 2003. “Spatial Autocorrelation or Model Misspecification?” Journal Article. International Regional Science Review 26 (2): 208–17. ISI:000181958400007 C:/Papers/International Regional Science Review/International Regional Science Review (2003) 26 (2) 208-217.pdf.
Moller, Jesper, and Rasmus Plenge Waagepetersen. 2003. Statistical Inference and Simulation for Spatial Point Processes. Chapman; Hall/CRC.
O’Sullivan, David, and David Unwin. 2010. Geographic Information Analysis. Book. 2nd. Edition. Hoboken, New Jersey: John Wiley & Sons.
Paez, A., S. Farber, and D. Wheeler. 2011. “A Simulation-Based Study of Geographically Weighted Regression as a Method for Investigating Spatially Varying Relationships.” Journal Article. Environment and Planning A 43 (12): 2992–3010. https://doi.org/10.1068/a44111.
Paez, A., F. Long, and S. Farber. 2008. “Moving Window Approaches for Hedonic Price Estimation: An Empirical Comparison of Modelling Techniques.” Journal Article. Urban Studies 45 (8): 1565–81. https://doi.org/10.1177/0042098008091491.
Plant, Richard E. 2012. Spatial Data Analysis in Ecology and Agriculture Using r. Book. cRc Press.
Rey, S. J. 2009. “Show Me the Code: Spatial Analysis and Open Source.” Journal Article. Journal of Geographical Systems 11 (2): 191–207. ISI:000266249500007.
Ripley, B. D. 1976. “2nd-Order Analysis of Stationary Point Processes.” Journal Article. Journal of Applied Probability 13 (2): 255–66. ISI:A1976CA37400007.
Tomlin, C Dana. 1990. A Map Algebra. Harvard Graduate School of Design Cambridge, MA.
Tong, Daoqin, and Alan T. Murray. 2012. “Spatial Optimization in Geography.” Journal Article. Annals of the Association of American Geographers 102 (6): 1290–1309. https://doi.org/10.1080/00045608.2012.685044.
Wickham, Hadley. 2015. R Packages: Organize, Test, Document, and Share Your Code. " O’Reilly Media, Inc.".
———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".
With, Kimberly A, and Anthony W King. 1997. “The Use and Misuse of Neutral Landscape Models in Ecology.” Oikos, 219–29.